Hello,

We have a Spark Streaming program that is currently running on a single
node in "local[n]" master mode. We currently give it local directories for
Spark's own state management etc. The input is streaming from network/flume
and output is also to network/kafka etc, so the process as such does not
need any distributed file system.

Now, we do want to start distributing this procesing across a few machines
and make a real cluster out of it. However, I am not sure if HDFS is a hard
requirement for that to happen. I am thinking about the Shuffle spills,
DStream/RDD persistence and checkpoint info. Do any of these require the
state to be shared via HDFS? Are there other alternatives that can be
utilized if state sharing is accomplished via the file system only.

Thanks
Nikunj

Reply via email to