Re: Is HDFS required for Spark streaming?

2015-09-09 Thread Tathagata Das
Actually, i think it wont work. If you are using some operation that
requires RDD checkpointing, then if the checkpoint files cannot be read
(because executor failed), any subsequent operations that needs that state
data cannot continue. So all subsequent batches will fail.

You could reduce the chances of the state data being lost by replicating
the state RDDs. Set the state DStream persistence level to
StorageLevel.MEMORY_ONLY_SER_2

On Wed, Sep 9, 2015 at 9:18 PM, N B  wrote:

> Thanks Cody and TD.
>
> If we do run with local directories, I suppose the checkpoint operation
> will write the various partitions of an RDD into their own local dirs (of
> course). So what's the worst that can happen in case of a node failure?
> Will the streaming batches continue to process (i.e. does the lost
> checkpointed data get recovered or recreated?) or will the entire
> Application start seeing Errors from that point onwards?
>
> Thanks
> Nikunj
>
>
> On Tue, Sep 8, 2015 at 11:54 AM, Tathagata Das 
> wrote:
>
>> You can use local directories in that case but it is not recommended and
>> not a well-test code path (so I have no idea what can happen).
>>
>> On Tue, Sep 8, 2015 at 6:59 AM, Cody Koeninger 
>> wrote:
>>
>>> Yes, local directories will be sufficient
>>>
>>> On Sat, Sep 5, 2015 at 10:44 AM, N B  wrote:
>>>
 Hi TD,

 Thanks!

 So our application does turn on checkpoints but we do not recover upon
 application restart (we just blow the checkpoint directory away first and
 re-create the StreamingContext) as we don't have a real need for that type
 of recovery. However, because the application does reduceeByKeyAndWindow
 operations, checkpointing has to be turned on. Do you think this scenario
 will also only work with HDFS or having local directories suffice?

 Thanks
 Nikunj



 On Fri, Sep 4, 2015 at 3:09 PM, Tathagata Das 
 wrote:

> Shuffle spills will use local disk, HDFS not needed.
> Spark and Spark Streaming checkpoint info WILL NEED HDFS for
> fault-tolerance. So that stuff can be recovered even if the spark cluster
> nodes go down.
>
> TD
>
> On Fri, Sep 4, 2015 at 2:45 PM, N B  wrote:
>
>> Hello,
>>
>> We have a Spark Streaming program that is currently running on a
>> single node in "local[n]" master mode. We currently give it local
>> directories for Spark's own state management etc. The input is streaming
>> from network/flume and output is also to network/kafka etc, so the 
>> process
>> as such does not need any distributed file system.
>>
>> Now, we do want to start distributing this procesing across a few
>> machines and make a real cluster out of it. However, I am not sure if 
>> HDFS
>> is a hard requirement for that to happen. I am thinking about the Shuffle
>> spills, DStream/RDD persistence and checkpoint info. Do any of these
>> require the state to be shared via HDFS? Are there other alternatives 
>> that
>> can be utilized if state sharing is accomplished via the file system 
>> only.
>>
>> Thanks
>> Nikunj
>>
>>
>

>>>
>>
>


Re: Is HDFS required for Spark streaming?

2015-09-09 Thread N B
Thanks Cody and TD.

If we do run with local directories, I suppose the checkpoint operation
will write the various partitions of an RDD into their own local dirs (of
course). So what's the worst that can happen in case of a node failure?
Will the streaming batches continue to process (i.e. does the lost
checkpointed data get recovered or recreated?) or will the entire
Application start seeing Errors from that point onwards?

Thanks
Nikunj


On Tue, Sep 8, 2015 at 11:54 AM, Tathagata Das  wrote:

> You can use local directories in that case but it is not recommended and
> not a well-test code path (so I have no idea what can happen).
>
> On Tue, Sep 8, 2015 at 6:59 AM, Cody Koeninger  wrote:
>
>> Yes, local directories will be sufficient
>>
>> On Sat, Sep 5, 2015 at 10:44 AM, N B  wrote:
>>
>>> Hi TD,
>>>
>>> Thanks!
>>>
>>> So our application does turn on checkpoints but we do not recover upon
>>> application restart (we just blow the checkpoint directory away first and
>>> re-create the StreamingContext) as we don't have a real need for that type
>>> of recovery. However, because the application does reduceeByKeyAndWindow
>>> operations, checkpointing has to be turned on. Do you think this scenario
>>> will also only work with HDFS or having local directories suffice?
>>>
>>> Thanks
>>> Nikunj
>>>
>>>
>>>
>>> On Fri, Sep 4, 2015 at 3:09 PM, Tathagata Das 
>>> wrote:
>>>
 Shuffle spills will use local disk, HDFS not needed.
 Spark and Spark Streaming checkpoint info WILL NEED HDFS for
 fault-tolerance. So that stuff can be recovered even if the spark cluster
 nodes go down.

 TD

 On Fri, Sep 4, 2015 at 2:45 PM, N B  wrote:

> Hello,
>
> We have a Spark Streaming program that is currently running on a
> single node in "local[n]" master mode. We currently give it local
> directories for Spark's own state management etc. The input is streaming
> from network/flume and output is also to network/kafka etc, so the process
> as such does not need any distributed file system.
>
> Now, we do want to start distributing this procesing across a few
> machines and make a real cluster out of it. However, I am not sure if HDFS
> is a hard requirement for that to happen. I am thinking about the Shuffle
> spills, DStream/RDD persistence and checkpoint info. Do any of these
> require the state to be shared via HDFS? Are there other alternatives that
> can be utilized if state sharing is accomplished via the file system only.
>
> Thanks
> Nikunj
>
>

>>>
>>
>


Re: Is HDFS required for Spark streaming?

2015-09-08 Thread Tathagata Das
You can use local directories in that case but it is not recommended and
not a well-test code path (so I have no idea what can happen).

On Tue, Sep 8, 2015 at 6:59 AM, Cody Koeninger  wrote:

> Yes, local directories will be sufficient
>
> On Sat, Sep 5, 2015 at 10:44 AM, N B  wrote:
>
>> Hi TD,
>>
>> Thanks!
>>
>> So our application does turn on checkpoints but we do not recover upon
>> application restart (we just blow the checkpoint directory away first and
>> re-create the StreamingContext) as we don't have a real need for that type
>> of recovery. However, because the application does reduceeByKeyAndWindow
>> operations, checkpointing has to be turned on. Do you think this scenario
>> will also only work with HDFS or having local directories suffice?
>>
>> Thanks
>> Nikunj
>>
>>
>>
>> On Fri, Sep 4, 2015 at 3:09 PM, Tathagata Das 
>> wrote:
>>
>>> Shuffle spills will use local disk, HDFS not needed.
>>> Spark and Spark Streaming checkpoint info WILL NEED HDFS for
>>> fault-tolerance. So that stuff can be recovered even if the spark cluster
>>> nodes go down.
>>>
>>> TD
>>>
>>> On Fri, Sep 4, 2015 at 2:45 PM, N B  wrote:
>>>
 Hello,

 We have a Spark Streaming program that is currently running on a single
 node in "local[n]" master mode. We currently give it local directories for
 Spark's own state management etc. The input is streaming from network/flume
 and output is also to network/kafka etc, so the process as such does not
 need any distributed file system.

 Now, we do want to start distributing this procesing across a few
 machines and make a real cluster out of it. However, I am not sure if HDFS
 is a hard requirement for that to happen. I am thinking about the Shuffle
 spills, DStream/RDD persistence and checkpoint info. Do any of these
 require the state to be shared via HDFS? Are there other alternatives that
 can be utilized if state sharing is accomplished via the file system only.

 Thanks
 Nikunj


>>>
>>
>


Re: Is HDFS required for Spark streaming?

2015-09-08 Thread Cody Koeninger
Yes, local directories will be sufficient

On Sat, Sep 5, 2015 at 10:44 AM, N B  wrote:

> Hi TD,
>
> Thanks!
>
> So our application does turn on checkpoints but we do not recover upon
> application restart (we just blow the checkpoint directory away first and
> re-create the StreamingContext) as we don't have a real need for that type
> of recovery. However, because the application does reduceeByKeyAndWindow
> operations, checkpointing has to be turned on. Do you think this scenario
> will also only work with HDFS or having local directories suffice?
>
> Thanks
> Nikunj
>
>
>
> On Fri, Sep 4, 2015 at 3:09 PM, Tathagata Das  wrote:
>
>> Shuffle spills will use local disk, HDFS not needed.
>> Spark and Spark Streaming checkpoint info WILL NEED HDFS for
>> fault-tolerance. So that stuff can be recovered even if the spark cluster
>> nodes go down.
>>
>> TD
>>
>> On Fri, Sep 4, 2015 at 2:45 PM, N B  wrote:
>>
>>> Hello,
>>>
>>> We have a Spark Streaming program that is currently running on a single
>>> node in "local[n]" master mode. We currently give it local directories for
>>> Spark's own state management etc. The input is streaming from network/flume
>>> and output is also to network/kafka etc, so the process as such does not
>>> need any distributed file system.
>>>
>>> Now, we do want to start distributing this procesing across a few
>>> machines and make a real cluster out of it. However, I am not sure if HDFS
>>> is a hard requirement for that to happen. I am thinking about the Shuffle
>>> spills, DStream/RDD persistence and checkpoint info. Do any of these
>>> require the state to be shared via HDFS? Are there other alternatives that
>>> can be utilized if state sharing is accomplished via the file system only.
>>>
>>> Thanks
>>> Nikunj
>>>
>>>
>>
>


Re: Is HDFS required for Spark streaming?

2015-09-05 Thread N B
Hi TD,

Thanks!

So our application does turn on checkpoints but we do not recover upon
application restart (we just blow the checkpoint directory away first and
re-create the StreamingContext) as we don't have a real need for that type
of recovery. However, because the application does reduceeByKeyAndWindow
operations, checkpointing has to be turned on. Do you think this scenario
will also only work with HDFS or having local directories suffice?

Thanks
Nikunj



On Fri, Sep 4, 2015 at 3:09 PM, Tathagata Das  wrote:

> Shuffle spills will use local disk, HDFS not needed.
> Spark and Spark Streaming checkpoint info WILL NEED HDFS for
> fault-tolerance. So that stuff can be recovered even if the spark cluster
> nodes go down.
>
> TD
>
> On Fri, Sep 4, 2015 at 2:45 PM, N B  wrote:
>
>> Hello,
>>
>> We have a Spark Streaming program that is currently running on a single
>> node in "local[n]" master mode. We currently give it local directories for
>> Spark's own state management etc. The input is streaming from network/flume
>> and output is also to network/kafka etc, so the process as such does not
>> need any distributed file system.
>>
>> Now, we do want to start distributing this procesing across a few
>> machines and make a real cluster out of it. However, I am not sure if HDFS
>> is a hard requirement for that to happen. I am thinking about the Shuffle
>> spills, DStream/RDD persistence and checkpoint info. Do any of these
>> require the state to be shared via HDFS? Are there other alternatives that
>> can be utilized if state sharing is accomplished via the file system only.
>>
>> Thanks
>> Nikunj
>>
>>
>


Re: Is HDFS required for Spark streaming?

2015-09-04 Thread Tathagata Das
Shuffle spills will use local disk, HDFS not needed.
Spark and Spark Streaming checkpoint info WILL NEED HDFS for
fault-tolerance. So that stuff can be recovered even if the spark cluster
nodes go down.

TD

On Fri, Sep 4, 2015 at 2:45 PM, N B  wrote:

> Hello,
>
> We have a Spark Streaming program that is currently running on a single
> node in "local[n]" master mode. We currently give it local directories for
> Spark's own state management etc. The input is streaming from network/flume
> and output is also to network/kafka etc, so the process as such does not
> need any distributed file system.
>
> Now, we do want to start distributing this procesing across a few machines
> and make a real cluster out of it. However, I am not sure if HDFS is a hard
> requirement for that to happen. I am thinking about the Shuffle spills,
> DStream/RDD persistence and checkpoint info. Do any of these require the
> state to be shared via HDFS? Are there other alternatives that can be
> utilized if state sharing is accomplished via the file system only.
>
> Thanks
> Nikunj
>
>


Is HDFS required for Spark streaming?

2015-09-04 Thread N B
Hello,

We have a Spark Streaming program that is currently running on a single
node in "local[n]" master mode. We currently give it local directories for
Spark's own state management etc. The input is streaming from network/flume
and output is also to network/kafka etc, so the process as such does not
need any distributed file system.

Now, we do want to start distributing this procesing across a few machines
and make a real cluster out of it. However, I am not sure if HDFS is a hard
requirement for that to happen. I am thinking about the Shuffle spills,
DStream/RDD persistence and checkpoint info. Do any of these require the
state to be shared via HDFS? Are there other alternatives that can be
utilized if state sharing is accomplished via the file system only.

Thanks
Nikunj