Thanks Cody and TD.

If we do run with local directories, I suppose the checkpoint operation
will write the various partitions of an RDD into their own local dirs (of
course). So what's the worst that can happen in case of a node failure?
Will the streaming batches continue to process (i.e. does the lost
checkpointed data get recovered or recreated?) or will the entire
Application start seeing Errors from that point onwards?

Thanks
Nikunj


On Tue, Sep 8, 2015 at 11:54 AM, Tathagata Das <t...@databricks.com> wrote:

> You can use local directories in that case but it is not recommended and
> not a well-test code path (so I have no idea what can happen).
>
> On Tue, Sep 8, 2015 at 6:59 AM, Cody Koeninger <c...@koeninger.org> wrote:
>
>> Yes, local directories will be sufficient
>>
>> On Sat, Sep 5, 2015 at 10:44 AM, N B <nb.nos...@gmail.com> wrote:
>>
>>> Hi TD,
>>>
>>> Thanks!
>>>
>>> So our application does turn on checkpoints but we do not recover upon
>>> application restart (we just blow the checkpoint directory away first and
>>> re-create the StreamingContext) as we don't have a real need for that type
>>> of recovery. However, because the application does reduceeByKeyAndWindow
>>> operations, checkpointing has to be turned on. Do you think this scenario
>>> will also only work with HDFS or having local directories suffice?
>>>
>>> Thanks
>>> Nikunj
>>>
>>>
>>>
>>> On Fri, Sep 4, 2015 at 3:09 PM, Tathagata Das <t...@databricks.com>
>>> wrote:
>>>
>>>> Shuffle spills will use local disk, HDFS not needed.
>>>> Spark and Spark Streaming checkpoint info WILL NEED HDFS for
>>>> fault-tolerance. So that stuff can be recovered even if the spark cluster
>>>> nodes go down.
>>>>
>>>> TD
>>>>
>>>> On Fri, Sep 4, 2015 at 2:45 PM, N B <nb.nos...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> We have a Spark Streaming program that is currently running on a
>>>>> single node in "local[n]" master mode. We currently give it local
>>>>> directories for Spark's own state management etc. The input is streaming
>>>>> from network/flume and output is also to network/kafka etc, so the process
>>>>> as such does not need any distributed file system.
>>>>>
>>>>> Now, we do want to start distributing this procesing across a few
>>>>> machines and make a real cluster out of it. However, I am not sure if HDFS
>>>>> is a hard requirement for that to happen. I am thinking about the Shuffle
>>>>> spills, DStream/RDD persistence and checkpoint info. Do any of these
>>>>> require the state to be shared via HDFS? Are there other alternatives that
>>>>> can be utilized if state sharing is accomplished via the file system only.
>>>>>
>>>>> Thanks
>>>>> Nikunj
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to