Re: choice of RDD function

2016-06-16 Thread Sivakumaran S
Dear Jacek and Cody, I receive a stream of JSON (exactly this much: 4 json objects) once every 30 seconds from Kafka as follows (I have changed my data source to include more fields) :

Re: choice of RDD function

2016-06-16 Thread Sivakumaran S
Hi Jacek and Cody, First of all, thanks for helping me out. I started with using combineByKey while testing with just one field. Of course it worked fine, but I was worried that the code would become unreadable if there were many fields. Which is why I shifted to sqlContext because the code

Re: choice of RDD function

2016-06-16 Thread Jacek Laskowski
Rather val df = sqlContext.read.json(rdd) Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Wed, Jun 15, 2016 at 11:55 PM, Sivakumaran S

Re: choice of RDD function

2016-06-16 Thread Jacek Laskowski
Hi, That's one of my concerns with the code. What concerned me the most is that the RDD(s) were converted to DataFrames only to registerTempTable and execute SQLs. I think it'd have better performance if DataFrame operators were used instead. Wish I had numbers. Pozdrawiam, Jacek Laskowski

Re: choice of RDD function

2016-06-15 Thread Cody Koeninger
Doesn't that result in consuming each RDD twice, in order to infer the json schema? On Wed, Jun 15, 2016 at 11:19 AM, Sivakumaran S wrote: > Of course :) > > object sparkStreaming { > def main(args: Array[String]) { > StreamingExamples.setStreamingLogLevels() //Set

Re: choice of RDD function

2016-06-15 Thread Sivakumaran S
Cody, Are you referring to the val lines = messages.map(_._2)? Regards, Siva > On 15-Jun-2016, at 10:32 PM, Cody Koeninger wrote: > > Doesn't that result in consuming each RDD twice, in order to infer the > json schema? > > On Wed, Jun 15, 2016 at 11:19 AM,

Re: choice of RDD function

2016-06-15 Thread Sivakumaran S
Of course :) object sparkStreaming { def main(args: Array[String]) { StreamingExamples.setStreamingLogLevels() //Set reasonable logging levels for streaming if the user has not configured log4j. val topics = "test" val brokers = "localhost:9092" val topicsSet =

Re: choice of RDD function

2016-06-15 Thread Jacek Laskowski
Hi, Good to hear so! Mind sharing a few snippets of your solution? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Wed, Jun 15, 2016 at 5:03 PM, Sivakumaran S

Re: choice of RDD function

2016-06-15 Thread Sivakumaran S
Thanks Jacek, Job completed!! :) Just used data frames and sql query. Very clean and functional code. Siva > On 15-Jun-2016, at 3:10 PM, Jacek Laskowski wrote: > > mapWithState

Re: choice of RDD function

2016-06-15 Thread Jacek Laskowski
Hi, Ad Q1, yes. See stateful operators like mapWithState and windows. Ad Q2, RDDs should be fine (and available out of the box), but I'd give Datasets a try too since they're .toDF away. Jacek On 14 Jun 2016 10:29 p.m., "Sivakumaran S" wrote: Dear friends, I have set up

choice of RDD function

2016-06-14 Thread Sivakumaran S
Dear friends, I have set up Kafka 0.9.0.0, Spark 1.6.1 and Scala 2.10. My source is sending a json string periodically to a topic in kafka. I am able to consume this topic using Spark Streaming and print it. The schema of the source json is as follows: { “id”: 121156, “ht”: 42, “rotor_rpm”: