Hi,

I have a setting where data arrives in Kafka and is stored to HDFS from
there (maybe using Camus or Flume). I want to write a Spark Streaming app
where
 - first all files in a that HDFS directory are processed,
 - and then the stream from Kafka is processed, starting
   with the first item that was not yet in HDFS.
The order of the data is somehow important, so I should really *first* do
the HDFS processing (which might take a while, by the way) and *then* start
stream processing.

Does anyone have any suggestions on how to implement this? Should I write a
custom receiver, a custom input stream, can I just use built-in mechanisms?

I would be happy to learn about any ideas.

Thanks
Tobias

Reply via email to