Re: Is HDFS required for Spark streaming?

2015-09-09 Thread N B
Thanks Cody and TD. If we do run with local directories, I suppose the checkpoint operation will write the various partitions of an RDD into their own local dirs (of course). So what's the worst that can happen in case of a node failure? Will the streaming batches continue to process (i.e. does

Re: Is HDFS required for Spark streaming?

2015-09-09 Thread Tathagata Das
Actually, i think it wont work. If you are using some operation that requires RDD checkpointing, then if the checkpoint files cannot be read (because executor failed), any subsequent operations that needs that state data cannot continue. So all subsequent batches will fail. You could reduce the

Re: Is HDFS required for Spark streaming?

2015-09-08 Thread Tathagata Das
You can use local directories in that case but it is not recommended and not a well-test code path (so I have no idea what can happen). On Tue, Sep 8, 2015 at 6:59 AM, Cody Koeninger wrote: > Yes, local directories will be sufficient > > On Sat, Sep 5, 2015 at 10:44 AM, N B

Re: Is HDFS required for Spark streaming?

2015-09-08 Thread Cody Koeninger
Yes, local directories will be sufficient On Sat, Sep 5, 2015 at 10:44 AM, N B wrote: > Hi TD, > > Thanks! > > So our application does turn on checkpoints but we do not recover upon > application restart (we just blow the checkpoint directory away first and > re-create the

Re: Is HDFS required for Spark streaming?

2015-09-05 Thread N B
Hi TD, Thanks! So our application does turn on checkpoints but we do not recover upon application restart (we just blow the checkpoint directory away first and re-create the StreamingContext) as we don't have a real need for that type of recovery. However, because the application does

Is HDFS required for Spark streaming?

2015-09-04 Thread N B
Hello, We have a Spark Streaming program that is currently running on a single node in "local[n]" master mode. We currently give it local directories for Spark's own state management etc. The input is streaming from network/flume and output is also to network/kafka etc, so the process as such

Re: Is HDFS required for Spark streaming?

2015-09-04 Thread Tathagata Das
Shuffle spills will use local disk, HDFS not needed. Spark and Spark Streaming checkpoint info WILL NEED HDFS for fault-tolerance. So that stuff can be recovered even if the spark cluster nodes go down. TD On Fri, Sep 4, 2015 at 2:45 PM, N B wrote: > Hello, > > We have a