Hello, We have a Spark Streaming program that is currently running on a single node in "local[n]" master mode. We currently give it local directories for Spark's own state management etc. The input is streaming from network/flume and output is also to network/kafka etc, so the process as such does not need any distributed file system.
Now, we do want to start distributing this procesing across a few machines and make a real cluster out of it. However, I am not sure if HDFS is a hard requirement for that to happen. I am thinking about the Shuffle spills, DStream/RDD persistence and checkpoint info. Do any of these require the state to be shared via HDFS? Are there other alternatives that can be utilized if state sharing is accomplished via the file system only. Thanks Nikunj