It's going well enough that this is a "how should I in 1.0.0" rather than
"how do i" question.

So I've got data coming in via Streaming (twitters) and I want to
archive/log it all. It seems a bit wasteful to generate a new HDFS file for
each DStream, but also I want to guard against data loss from crashes,

I suppose what I want is to let things build up into "superbatches" over a
few minutes, and then serialize those to parquet files, or similar? Or do i?

Do I count-down the number of DStreams, or does Spark have a preferred way
of scheduling cron events?

What's the best practise for keeping persistent data for a streaming app?
(Across restarts) And to clean up on termination?


-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers

Reply via email to