It's going well enough that this is a "how should I in 1.0.0" rather than "how do i" question.
So I've got data coming in via Streaming (twitters) and I want to archive/log it all. It seems a bit wasteful to generate a new HDFS file for each DStream, but also I want to guard against data loss from crashes, I suppose what I want is to let things build up into "superbatches" over a few minutes, and then serialize those to parquet files, or similar? Or do i? Do I count-down the number of DStreams, or does Spark have a preferred way of scheduling cron events? What's the best practise for keeping persistent data for a streaming app? (Across restarts) And to clean up on termination? -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers