Re: org.apache.spark.SparkException: Task not serializable

2017-03-10 Thread ??????????
hi mina, can you paste your new code here pleasel i meet this issue too but do not get Ankur's idea. thanks Robin ---Original--- From: "Mina Aslani" Date: 2017/3/7 05:32:10 To: "Ankur Srivastava"; Cc:

Re: question on Write Ahead Log (Spark Streaming )

2017-03-10 Thread Dibyendu Bhattacharya
Hi, You could also use this Receiver : https://github.com/dibbhatt/kafka-spark-consumer This is part of spark-packages also : https://spark-packages.org/package/dibbhatt/kafka-spark-consumer You do not need to enable WAL in this and still recover from Driver failure with no data loss. You can

How to improve performance of saveAsTextFile()

2017-03-10 Thread Parsian, Mahmoud
How to improve performance of JavaRDD.saveAsTextFile(“hdfs://…“). This is taking over 30 minutes on a cluster of 10 nodes. Running Spark on YARN. JavaRDD has 120 million entries. Thank you, Best regards, Mahmoud

Re: spark streaming with kafka source, how many concurrent jobs?

2017-03-10 Thread Tathagata Das
That config I not safe. Please do not use it. On Mar 10, 2017 10:03 AM, "shyla deshpande" wrote: > I have a spark streaming application which processes 3 kafka streams and > has 5 output operations. > > Not sure what should be the setting for

Re: How to gracefully handle Kafka OffsetOutOfRangeException

2017-03-10 Thread Justin Miller
I've created a ticket here: https://issues.apache.org/jira/browse/SPARK-19888 Thanks, Justin > On Mar 10, 2017, at 1:14 PM, Michael Armbrust wrote: > > If you have a reproduction you should open a JIRA. It would be

Re: How to gracefully handle Kafka OffsetOutOfRangeException

2017-03-10 Thread Michael Armbrust
If you have a reproduction you should open a JIRA. It would be great if there is a fix. I'm just saying I know a similar issue does not exist in structured streaming. On Fri, Mar 10, 2017 at 7:46 AM, Justin Miller < justin.mil...@protectwise.com> wrote: > Hi Michael, > > I'm experiencing a

Re: Apparent memory leak involving count

2017-03-10 Thread Facundo Domínguez
> You seem to generate always a new rdd instead of reusing the existing. So I > does not seem surprising that the memory need is growing. Thanks for talking a look. Unfortunately, memory grows regardless of whether just one RDD is used or one per iteration. > This is how it can display the

spark streaming with kafka source, how many concurrent jobs?

2017-03-10 Thread shyla deshpande
I have a spark streaming application which processes 3 kafka streams and has 5 output operations. Not sure what should be the setting for spark.streaming.concurrentJobs. 1. If the concurrentJobs setting is 4 does that mean 2 output operations will be run sequentially? 2. If I had 6 cores what

Re: can spark take advantage of ordered data?

2017-03-10 Thread Jonathan Coveney
While I was at Two Sigma I ended up implementing something similar to what Koert described... you can check it out here: https://github.com/twosigma/flint/blob/master/src/main/scala/com/twosigma/flint/rdd/OrderedRDD.scala. They've built a lot more on top of this (including support for dataframes

Re: How to gracefully handle Kafka OffsetOutOfRangeException

2017-03-10 Thread Justin Miller
Hi Michael, I'm experiencing a similar issue. Will this not be fixed in Spark Streaming? Best, Justin > On Mar 10, 2017, at 8:34 AM, Michael Armbrust wrote: > > One option here would be to try Structured Streaming. We've added an option > "failOnDataLoss" that will

Re: can spark take advantage of ordered data?

2017-03-10 Thread Yong Zhang
I think it is an interesting requirement, but I am not familiar with Spark enough to say it can be done as latest spark version or not. >From my understanding, you are looking for some API from the spark to read the >source directly into a ShuffledRDD, which indeed needs (K, V and a

Re: How to gracefully handle Kafka OffsetOutOfRangeException

2017-03-10 Thread Michael Armbrust
One option here would be to try Structured Streaming. We've added an option "failOnDataLoss" that will cause Spark to just skip a head when this exception is encountered (its off by default though so you don't silently miss data). On Fri, Mar 18, 2016 at 4:16 AM, Ramkumar Venkataraman <

Re: can spark take advantage of ordered data?

2017-03-10 Thread Koert Kuipers
this shouldn't be too hard. adding something to spark-sorted or to the dataframe/dataset logical plan that says "trust me, i am already partitioned and sorted" seems doable. however you most likely need a custom hash partitioner, and you have to be careful to read the data in without file

Re: can spark take advantage of ordered data?

2017-03-10 Thread sourabh chaki
My use case is also quite similar. I have 2 feeds. One 3TB and another 100GB. Both the feeds are generated by hadoop reduce operation and partitioned by hadoop hashpartitioner. 3TB feed has 10K partitions whereas 100GB file has 200 partitions. Now when I do a join between these two feeds using

Re: Which streaming platform is best? Kafka or Spark Streaming?

2017-03-10 Thread vaquar khan
Please read Spark documents at least once before asking question. http://spark.apache.org/docs/latest/streaming-programming-guide.html http://2s7gjr373w3x22jf92z99mgm5w-wpengine.netdna-ssl.com/wp-content/uploads/2015/11/spark-streaming-datanami.png Regards, Vaquar khan On Fri, Mar 10, 2017

Re: Question on Spark's graph libraries

2017-03-10 Thread Md. Rezaul Karim
+1 Regards, _ *Md. Rezaul Karim*, BSc, MSc PhD Researcher, INSIGHT Centre for Data Analytics National University of Ireland, Galway IDA Business Park, Dangan, Galway, Ireland Web: http://www.reza-analytics.eu/index.html On 10

Re: Which streaming platform is best? Kafka or Spark Streaming?

2017-03-10 Thread Sean Owen
Kafka and Spark Streaming don't do the same thing. Kafka stores and transports data, Spark Streaming runs computations on a stream of data. Neither is itself a streaming platform in its entirety. It's kind of like asking whether you should build a website using just MySQL, or nginx. > On 9 Mar

Re: Question on Spark's graph libraries

2017-03-10 Thread Robin East
I would love to know the answer to that too. --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action

Re: Which streaming platform is best? Kafka or Spark Streaming?

2017-03-10 Thread Robin East
As Jorn says there is no best. I would start with https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101. This will help you form some meaningful questions about what tools suit which use cases. Most places have a selection of tools such as spark, kafka, flink, storm, flume and so

[Spark Streaming][Spark SQL] Design suggestions needed for sessionization

2017-03-10 Thread Ramkumar Venkataraman
At high-level, I am looking to do sessionization. I want to combine events based on some key, do some transformations and emit data to HDFS. The catch is there are time boundaries, say, I group events in a window of 0.5 hours, based on some timestamp key in the event. Typical event-time windowing

Re: How to gracefully handle Kafka OffsetOutOfRangeException

2017-03-10 Thread Ramkumar Venkataraman
Nope, but when we migrated to spark 1.6, we didnt see the errors yet. Not sure if they fixed in between releases or it just be a weird timing thing that we havent discovered yet in 1.6 as well. On Sat, Mar 4, 2017 at 12:00 AM, nimmi.cv [via Apache Spark User List] <

How can an RDD make its every elements to a new RDD ?

2017-03-10 Thread Mars Xu
Hi users, I’m fresh to RDD Programming, my problem as the title, what I do is to read a source file through sc, then do a groupByKey get new RDD,now I want to do the other groupByKey based on the former RDD’s every element. for example,my source file as follow:

Re: Case class with POJO - encoder issues

2017-03-10 Thread geoHeil
http://stackoverflow.com/questions/36648128/how-to-store-custom-objects-in-a-dataset describes the Problem. Actually, I have the same Problem. Is there a simple way to build such an Encoder which serializes into multiple fields? I would not want to replicate the Whole JTS geometry class hierarchy