Kafka Direct Stream - dynamic topic subscription

2017-10-27 Thread Ramanan, Buvana (Nokia - US/Murray Hill)
Hello, Using Spark 2.2.0. Interested in seeing the action of dynamic topic subscription. Tried this example: streaming.DirectKafkaWordCount (which uses org.apache.spark.streaming.kafka010) I start with 8 Kafka partitions in my topic and found that Spark Streaming executes 8 tasks (one per par

Spark 2.2.0 GC Overhead Limit Exceeded and OOM errors in the executors

2017-10-27 Thread Supun Nakandala
Hi all, I am trying to do some image analytics type workload using Spark. The images are read in JPEG format and then are converted to the raw format in map functions and this causes the size of the partitions to grow by an order of 1. In addition to this, I am caching some of the data because my

StringIndexer on several columns in a DataFrame with Scala

2017-10-27 Thread Md. Rezaul Karim
Hi All, There are several categorical columns in my dataset as follows: [image: Inline images 1] How can I transform values in each (categorical) columns into numeric using StringIndexer so that the resulting DataFrame can be feed into VectorAssembler to generate a feature vector? A naive approa

Re: Structured Stream in Spark

2017-10-27 Thread KhajaAsmath Mohammed
Yes I checked both the output location and console too. It doesnt have any data. link also has the code and question that I have raised with Azure HDInsights. https://github.com/Azure/spark-eventhubs/issues/195 On Fri, Oct 27, 2017 at 3:22 PM, Shixiong(Ryan) Zhu wrote: > The codes in the link

Re: Anyone knows how to build and spark on jdk9?

2017-10-27 Thread Jean Georges Perrin
May I ask what is the use case? Although it is a very interesting question, but I would be concerned about going further than a proof of concept. A lot of the enterprises I see and visit are barely on Java8, so starting to talk JDK 9 might be a slight overkill but if you have a good story, I’m a

Re: Structured Stream in Spark

2017-10-27 Thread Shixiong(Ryan) Zhu
The codes in the link write the data into files. Did you check the output location? By the way, if you want to see the data on the console, you can use the console sink by changing this line *format("parquet").option("path", outputPath + "/ETL").partitionBy("creationTime").start()* to *format("con

Re: Anyone knows how to build and spark on jdk9?

2017-10-27 Thread Sean Owen
Certainly, Scala 2.12 support precedes Java 9 support. A lot of the work is in place already, and the last issue is dealing with how Scala closures are now implemented quite different with lambdas / invokedynamic. This affects the ClosureCleaner. For the interested, this is as far as I know the mai

Re: Anyone knows how to build and spark on jdk9?

2017-10-27 Thread Jörn Franke
Scala 2.12 is not yet supported on Spark - this means also not JDK9: https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-14220 If you look at the Oracle support then jdk 9 is anyway only supported for 6 months. JDK 8 is Lts (5 years) JDK 18.3 will be only 6 months and JDK 18.9 is l

Re: Structured Stream in Spark

2017-10-27 Thread KhajaAsmath Mohammed
Hi TathagataDas, I was trying to use eventhub with spark streaming. Looks like I was able to make connection successfully but cannot see any data on the console. Not sure if eventhub is supported or not. https://github.com/Azure/spark-eventhubs/blob/master/examples/src/main/scala/com/microsoft/sp

Re: Structured streaming with event hubs

2017-10-27 Thread KhajaAsmath Mohammed
I was looking at this example but didnt get any output from it when used. https://github.com/Azure/spark-eventhubs/blob/master/examples/src/main/scala/com/microsoft/spark/sql/examples/EventHubsStructuredStreamingExample.scala On Fri, Oct 27, 2017 at 9:18 AM, ayan guha wrote: > Does event hub

Re: Structured streaming with event hubs

2017-10-27 Thread ayan guha
Does event hub support seuctured streaming at all yet? On Fri, 27 Oct 2017 at 1:43 pm, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > Hi, > > Could anyone share if there is any code snippet on how to use spark > structured streaming with event hubs ?? > > Thanks, > Asmath > > Sent from

Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

2017-10-27 Thread Thakrar, Jayesh
What you have is sequential and hence sequential processing. Also Spark/Scala are not parallel programming languages. But even if they were, statements are executed sequentially unless you exploit the parallel/concurrent execution features. Anyway, see if this works: val (RDD1, RDD2) = (JavaFunc

Re: Orc predicate pushdown with Spark Sql

2017-10-27 Thread Siva Gudavalli
I found a workaround, when I create Hive Table using Spark “saveAsTable”, I see filters being pushed down. -> other approaches I tried where filters are not pushed down Is, 1) when I create Hive Table upfront and load orc into it using Spark SQL 2) when I create orc files using spark SQL and t

cosine similarity between rows

2017-10-27 Thread Donni Khan
I have spark job to compute the similarity between text documents: RowMatrix rowMatrix = new RowMatrix(vectorsRDD.rdd()); CoordinateMatrix rowsimilarity=rowMatrix.columnSimilarities(0.5);JavaRDD entries = rowsimilarity.entries().toJavaRDD(); List list = entries.collect(); for(MatrixEntry s : list)