Re: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-02-23 Thread Tathagata Das
There are different kinds of checkpointing going on. updateStateByKey requires RDD checkpointing which can be enabled only by called sparkContext.setCheckpointDirectory. But that does not enable Spark Streaming driver checkpoints, which is necessary for recovering from driver failures. That is

Re: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-02-23 Thread Sean Owen
I reached a similar conclusion about checkpointing . It requires your entire computation to be serializable, even all of the 'local' bits. Which makes sense. In my case I do not use checkpointing and it is fine to restart the driver in the case of failure and not try to recover its state. What I

Re: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-02-23 Thread Tobias Pfeiffer
Sean, thanks for your message! On Mon, Feb 23, 2015 at 6:03 PM, Sean Owen so...@cloudera.com wrote: What I haven't investigated is whether you can enable checkpointing for the state in updateStateByKey separately from this mechanism, which is exactly your question. What happens if you set a

Re: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-02-23 Thread Tobias Pfeiffer
Hi, On Tue, Feb 24, 2015 at 4:34 AM, Tathagata Das tathagata.das1...@gmail.com wrote: There are different kinds of checkpointing going on. updateStateByKey requires RDD checkpointing which can be enabled only by called sparkContext.setCheckpointDirectory. But that does not enable Spark

RE: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-01-27 Thread Shao, Saisai
Aha, you’re right, I did a wrong comparison, the reason might be only for checkpointing :). Thanks Jerry From: Tobias Pfeiffer [mailto:t...@preferred.jp] Sent: Wednesday, January 28, 2015 10:39 AM To: Shao, Saisai Cc: user Subject: Re: Why must the dstream.foreachRDD(...) parameter

Re: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-01-27 Thread Matei Zaharia
I believe this is needed for driver recovery in Spark Streaming. If your Spark driver program crashes, Spark Streaming can recover the application by reading the set of DStreams and output operations from a checkpoint file (see

RE: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-01-27 Thread Shao, Saisai
Hey Tobias, I think one consideration is for checkpoint of DStream which guarantee driver fault tolerance. Also this `foreachFunc` is more like an action function of RDD, thinking of rdd.foreach(func), in which `func` need to be serializable. So maybe I think your way of use it is not a

Re: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-01-27 Thread Tobias Pfeiffer
Hi, thanks for the answers! On Wed, Jan 28, 2015 at 11:31 AM, Shao, Saisai saisai.s...@intel.com wrote: Also this `foreachFunc` is more like an action function of RDD, thinking of rdd.foreach(func), in which `func` need to be serializable. So maybe I think your way of use it is not a normal