Hi, I have a setup where data from an external stream is piped into Kafka and also written to HDFS periodically for long-term storage. Now I am trying to build an application that will first process the HDFS files and then switch to Kafka, continuing with the first item that was not yet in HDFS. (The items have an increasing timestamp that I can use to find the "first item not yet processed".) I am wondering what an elegant method to provide a unified view of the data would be.
Currently, I am using two StreamingContexts one after another: - start one StreamingContext A and process all data found in HDFS (updating the largest seen timestamp in an accumulator), stopping when there was an RDD with 0 items in it, - stop that StreamingContext A, - start a new StreamingContext B and process the Kafka stream (filtering out all items with a timestamp smaller than the value in the accumulator), - stop when the user requests it. This *works* as it is now, but I was planning to add sliding windows (based on item counts or the timestamps in the data), which will get unmanageably complicated when I have a window spanning data in both HDFS and Kafka, I guess. Therefore I would like to have a single DStream that is fed with first HDFS and then Kafka data. Does anyone have a suggestion on how to realize that (with as few copying of private[spark] classes as possible)? I guess one issue is that the Kafka processing requires a receiver and therefore needs to be treated quite a bit differently than HDFS? Thanks Tobias