Re: Lifecycle of RDD in spark-streaming

2014-11-27 Thread Gerard Maas
Hi TD, We also struggled with this error for a long while. The recurring scenario is when the job takes longer to compute than the job interval and a backlog starts to pile up. Hint: Check If the DStream storage level is set to MEMORY_ONLY_SER and memory runs out, then you will get a 'Cannot

Re: Lifecycle of RDD in spark-streaming

2014-11-27 Thread Bill Jay
Gerard, That is a good observation. However, the strange thing I meet is if I use MEMORY_AND_DISK_SER, the job even fails earlier. In my case, it takes 10 seconds to process my data of every batch, which is one minute. It fails after 10 hours with the cannot compute split error. Bill On Thu,

Re: Lifecycle of RDD in spark-streaming

2014-11-27 Thread Tathagata Das
If it regularly fails after 8 hours then could you get me the log4j logs? To limit the size, set default log level to Warn and the level of logs for all classes in package o.a.s.streaming to Debug. Then I can take a look. On Nov 27, 2014 11:01 AM, Bill Jay bill.jaypeter...@gmail.com wrote:

Re: Lifecycle of RDD in spark-streaming

2014-11-27 Thread Harihar Nahak
.1001560.n3.nabble.com/Lifecycle-of-RDD-in-spark-streaming-tp19749p19835.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template

Re: Lifecycle of RDD in spark-streaming

2014-11-26 Thread lalit1303
/Lifecycle-of-RDD-in-spark-streaming-tp19749p19850.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h

Re: Lifecycle of RDD in spark-streaming

2014-11-26 Thread tian zhang
I have found this paper seems to answer most of questions about life duration.https://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf Tian On Tuesday, November 25, 2014 4:02 AM, Mukesh Jha me.mukesh@gmail.com wrote: Hey Experts, I wanted to understand in

Re: Lifecycle of RDD in spark-streaming

2014-11-26 Thread Bill Jay
Just add one more point. If Spark streaming knows when the RDD will not be used any more, I believe Spark will not try to retrieve data it will not use any more. However, in practice, I often encounter the error of cannot compute split. Based on my understanding, this is because Spark cleared out

Re: Lifecycle of RDD in spark-streaming

2014-11-26 Thread Tathagata Das
Can you elaborate on the usage pattern that lead to cannot compute split ? Are you using the RDDs generated by DStream, outside the DStream logic? Something like running interactive Spark jobs (independent of the Spark Streaming ones) on RDDs generated by DStreams? If that is the case, what is

Re: Lifecycle of RDD in spark-streaming

2014-11-26 Thread Bill Jay
Hi TD, I am using Spark Streaming to consume data from Kafka and do some aggregation and ingest the results into RDS. I do use foreachRDD in the program. I am planning to use Spark streaming in our production pipeline and it performs well in generating the results. Unfortunately, we plan to have

Lifecycle of RDD in spark-streaming

2014-11-25 Thread Mukesh Jha
Hey Experts, I wanted to understand in detail about the lifecycle of rdd(s) in a streaming app. From my current understanding - rdd gets created out of the realtime input stream. - Transform(s) functions are applied in a lazy fashion on the RDD to transform into another rdd(s). - Actions are

Re: Lifecycle of RDD in spark-streaming

2014-11-25 Thread Mukesh Jha
Any pointers guys? On Tue, Nov 25, 2014 at 5:32 PM, Mukesh Jha me.mukesh@gmail.com wrote: Hey Experts, I wanted to understand in detail about the lifecycle of rdd(s) in a streaming app. From my current understanding - rdd gets created out of the realtime input stream. - Transform(s)