Re: Is HDFS required for Spark streaming?
Actually, i think it wont work. If you are using some operation that requires RDD checkpointing, then if the checkpoint files cannot be read (because executor failed), any subsequent operations that needs that state data cannot continue. So all subsequent batches will fail. You could reduce the chances of the state data being lost by replicating the state RDDs. Set the state DStream persistence level to StorageLevel.MEMORY_ONLY_SER_2 On Wed, Sep 9, 2015 at 9:18 PM, N B wrote: > Thanks Cody and TD. > > If we do run with local directories, I suppose the checkpoint operation > will write the various partitions of an RDD into their own local dirs (of > course). So what's the worst that can happen in case of a node failure? > Will the streaming batches continue to process (i.e. does the lost > checkpointed data get recovered or recreated?) or will the entire > Application start seeing Errors from that point onwards? > > Thanks > Nikunj > > > On Tue, Sep 8, 2015 at 11:54 AM, Tathagata Das > wrote: > >> You can use local directories in that case but it is not recommended and >> not a well-test code path (so I have no idea what can happen). >> >> On Tue, Sep 8, 2015 at 6:59 AM, Cody Koeninger >> wrote: >> >>> Yes, local directories will be sufficient >>> >>> On Sat, Sep 5, 2015 at 10:44 AM, N B wrote: >>> Hi TD, Thanks! So our application does turn on checkpoints but we do not recover upon application restart (we just blow the checkpoint directory away first and re-create the StreamingContext) as we don't have a real need for that type of recovery. However, because the application does reduceeByKeyAndWindow operations, checkpointing has to be turned on. Do you think this scenario will also only work with HDFS or having local directories suffice? Thanks Nikunj On Fri, Sep 4, 2015 at 3:09 PM, Tathagata Das wrote: > Shuffle spills will use local disk, HDFS not needed. > Spark and Spark Streaming checkpoint info WILL NEED HDFS for > fault-tolerance. So that stuff can be recovered even if the spark cluster > nodes go down. > > TD > > On Fri, Sep 4, 2015 at 2:45 PM, N B wrote: > >> Hello, >> >> We have a Spark Streaming program that is currently running on a >> single node in "local[n]" master mode. We currently give it local >> directories for Spark's own state management etc. The input is streaming >> from network/flume and output is also to network/kafka etc, so the >> process >> as such does not need any distributed file system. >> >> Now, we do want to start distributing this procesing across a few >> machines and make a real cluster out of it. However, I am not sure if >> HDFS >> is a hard requirement for that to happen. I am thinking about the Shuffle >> spills, DStream/RDD persistence and checkpoint info. Do any of these >> require the state to be shared via HDFS? Are there other alternatives >> that >> can be utilized if state sharing is accomplished via the file system >> only. >> >> Thanks >> Nikunj >> >> > >>> >> >
Re: Is HDFS required for Spark streaming?
Thanks Cody and TD. If we do run with local directories, I suppose the checkpoint operation will write the various partitions of an RDD into their own local dirs (of course). So what's the worst that can happen in case of a node failure? Will the streaming batches continue to process (i.e. does the lost checkpointed data get recovered or recreated?) or will the entire Application start seeing Errors from that point onwards? Thanks Nikunj On Tue, Sep 8, 2015 at 11:54 AM, Tathagata Das wrote: > You can use local directories in that case but it is not recommended and > not a well-test code path (so I have no idea what can happen). > > On Tue, Sep 8, 2015 at 6:59 AM, Cody Koeninger wrote: > >> Yes, local directories will be sufficient >> >> On Sat, Sep 5, 2015 at 10:44 AM, N B wrote: >> >>> Hi TD, >>> >>> Thanks! >>> >>> So our application does turn on checkpoints but we do not recover upon >>> application restart (we just blow the checkpoint directory away first and >>> re-create the StreamingContext) as we don't have a real need for that type >>> of recovery. However, because the application does reduceeByKeyAndWindow >>> operations, checkpointing has to be turned on. Do you think this scenario >>> will also only work with HDFS or having local directories suffice? >>> >>> Thanks >>> Nikunj >>> >>> >>> >>> On Fri, Sep 4, 2015 at 3:09 PM, Tathagata Das >>> wrote: >>> Shuffle spills will use local disk, HDFS not needed. Spark and Spark Streaming checkpoint info WILL NEED HDFS for fault-tolerance. So that stuff can be recovered even if the spark cluster nodes go down. TD On Fri, Sep 4, 2015 at 2:45 PM, N B wrote: > Hello, > > We have a Spark Streaming program that is currently running on a > single node in "local[n]" master mode. We currently give it local > directories for Spark's own state management etc. The input is streaming > from network/flume and output is also to network/kafka etc, so the process > as such does not need any distributed file system. > > Now, we do want to start distributing this procesing across a few > machines and make a real cluster out of it. However, I am not sure if HDFS > is a hard requirement for that to happen. I am thinking about the Shuffle > spills, DStream/RDD persistence and checkpoint info. Do any of these > require the state to be shared via HDFS? Are there other alternatives that > can be utilized if state sharing is accomplished via the file system only. > > Thanks > Nikunj > > >>> >> >
Re: Is HDFS required for Spark streaming?
You can use local directories in that case but it is not recommended and not a well-test code path (so I have no idea what can happen). On Tue, Sep 8, 2015 at 6:59 AM, Cody Koeninger wrote: > Yes, local directories will be sufficient > > On Sat, Sep 5, 2015 at 10:44 AM, N B wrote: > >> Hi TD, >> >> Thanks! >> >> So our application does turn on checkpoints but we do not recover upon >> application restart (we just blow the checkpoint directory away first and >> re-create the StreamingContext) as we don't have a real need for that type >> of recovery. However, because the application does reduceeByKeyAndWindow >> operations, checkpointing has to be turned on. Do you think this scenario >> will also only work with HDFS or having local directories suffice? >> >> Thanks >> Nikunj >> >> >> >> On Fri, Sep 4, 2015 at 3:09 PM, Tathagata Das >> wrote: >> >>> Shuffle spills will use local disk, HDFS not needed. >>> Spark and Spark Streaming checkpoint info WILL NEED HDFS for >>> fault-tolerance. So that stuff can be recovered even if the spark cluster >>> nodes go down. >>> >>> TD >>> >>> On Fri, Sep 4, 2015 at 2:45 PM, N B wrote: >>> Hello, We have a Spark Streaming program that is currently running on a single node in "local[n]" master mode. We currently give it local directories for Spark's own state management etc. The input is streaming from network/flume and output is also to network/kafka etc, so the process as such does not need any distributed file system. Now, we do want to start distributing this procesing across a few machines and make a real cluster out of it. However, I am not sure if HDFS is a hard requirement for that to happen. I am thinking about the Shuffle spills, DStream/RDD persistence and checkpoint info. Do any of these require the state to be shared via HDFS? Are there other alternatives that can be utilized if state sharing is accomplished via the file system only. Thanks Nikunj >>> >> >
Re: Is HDFS required for Spark streaming?
Yes, local directories will be sufficient On Sat, Sep 5, 2015 at 10:44 AM, N B wrote: > Hi TD, > > Thanks! > > So our application does turn on checkpoints but we do not recover upon > application restart (we just blow the checkpoint directory away first and > re-create the StreamingContext) as we don't have a real need for that type > of recovery. However, because the application does reduceeByKeyAndWindow > operations, checkpointing has to be turned on. Do you think this scenario > will also only work with HDFS or having local directories suffice? > > Thanks > Nikunj > > > > On Fri, Sep 4, 2015 at 3:09 PM, Tathagata Das wrote: > >> Shuffle spills will use local disk, HDFS not needed. >> Spark and Spark Streaming checkpoint info WILL NEED HDFS for >> fault-tolerance. So that stuff can be recovered even if the spark cluster >> nodes go down. >> >> TD >> >> On Fri, Sep 4, 2015 at 2:45 PM, N B wrote: >> >>> Hello, >>> >>> We have a Spark Streaming program that is currently running on a single >>> node in "local[n]" master mode. We currently give it local directories for >>> Spark's own state management etc. The input is streaming from network/flume >>> and output is also to network/kafka etc, so the process as such does not >>> need any distributed file system. >>> >>> Now, we do want to start distributing this procesing across a few >>> machines and make a real cluster out of it. However, I am not sure if HDFS >>> is a hard requirement for that to happen. I am thinking about the Shuffle >>> spills, DStream/RDD persistence and checkpoint info. Do any of these >>> require the state to be shared via HDFS? Are there other alternatives that >>> can be utilized if state sharing is accomplished via the file system only. >>> >>> Thanks >>> Nikunj >>> >>> >> >
Re: Is HDFS required for Spark streaming?
Hi TD, Thanks! So our application does turn on checkpoints but we do not recover upon application restart (we just blow the checkpoint directory away first and re-create the StreamingContext) as we don't have a real need for that type of recovery. However, because the application does reduceeByKeyAndWindow operations, checkpointing has to be turned on. Do you think this scenario will also only work with HDFS or having local directories suffice? Thanks Nikunj On Fri, Sep 4, 2015 at 3:09 PM, Tathagata Das wrote: > Shuffle spills will use local disk, HDFS not needed. > Spark and Spark Streaming checkpoint info WILL NEED HDFS for > fault-tolerance. So that stuff can be recovered even if the spark cluster > nodes go down. > > TD > > On Fri, Sep 4, 2015 at 2:45 PM, N B wrote: > >> Hello, >> >> We have a Spark Streaming program that is currently running on a single >> node in "local[n]" master mode. We currently give it local directories for >> Spark's own state management etc. The input is streaming from network/flume >> and output is also to network/kafka etc, so the process as such does not >> need any distributed file system. >> >> Now, we do want to start distributing this procesing across a few >> machines and make a real cluster out of it. However, I am not sure if HDFS >> is a hard requirement for that to happen. I am thinking about the Shuffle >> spills, DStream/RDD persistence and checkpoint info. Do any of these >> require the state to be shared via HDFS? Are there other alternatives that >> can be utilized if state sharing is accomplished via the file system only. >> >> Thanks >> Nikunj >> >> >
Re: Is HDFS required for Spark streaming?
Shuffle spills will use local disk, HDFS not needed. Spark and Spark Streaming checkpoint info WILL NEED HDFS for fault-tolerance. So that stuff can be recovered even if the spark cluster nodes go down. TD On Fri, Sep 4, 2015 at 2:45 PM, N B wrote: > Hello, > > We have a Spark Streaming program that is currently running on a single > node in "local[n]" master mode. We currently give it local directories for > Spark's own state management etc. The input is streaming from network/flume > and output is also to network/kafka etc, so the process as such does not > need any distributed file system. > > Now, we do want to start distributing this procesing across a few machines > and make a real cluster out of it. However, I am not sure if HDFS is a hard > requirement for that to happen. I am thinking about the Shuffle spills, > DStream/RDD persistence and checkpoint info. Do any of these require the > state to be shared via HDFS? Are there other alternatives that can be > utilized if state sharing is accomplished via the file system only. > > Thanks > Nikunj > >
Is HDFS required for Spark streaming?
Hello, We have a Spark Streaming program that is currently running on a single node in "local[n]" master mode. We currently give it local directories for Spark's own state management etc. The input is streaming from network/flume and output is also to network/kafka etc, so the process as such does not need any distributed file system. Now, we do want to start distributing this procesing across a few machines and make a real cluster out of it. However, I am not sure if HDFS is a hard requirement for that to happen. I am thinking about the Shuffle spills, DStream/RDD persistence and checkpoint info. Do any of these require the state to be shared via HDFS? Are there other alternatives that can be utilized if state sharing is accomplished via the file system only. Thanks Nikunj