Hi,

I have a setup where data from an external stream is piped into Kafka and
also written to HDFS periodically for long-term storage. Now I am trying to
build an application that will first process the HDFS files and then switch
to Kafka, continuing with the first item that was not yet in HDFS. (The
items have an increasing timestamp that I can use to find the "first item
not yet processed".) I am wondering what an elegant method to provide a
unified view of the data would be.

Currently, I am using two StreamingContexts one after another:
 - start one StreamingContext A and process all data found in HDFS
(updating the largest seen timestamp in an accumulator), stopping when
there was an RDD with 0 items in it,
 - stop that StreamingContext A,
 - start a new StreamingContext B and process the Kafka stream (filtering
out all items with a timestamp smaller than the value in the accumulator),
 - stop when the user requests it.

This *works* as it is now, but I was planning to add sliding windows (based
on item counts or the timestamps in the data), which will get unmanageably
complicated when I have a window spanning data in both HDFS and Kafka, I
guess. Therefore I would like to have a single DStream that is fed with
first HDFS and then Kafka data. Does anyone have a suggestion on how to
realize that (with as few copying of private[spark] classes as possible)? I
guess one issue is that the Kafka processing requires a receiver and
therefore needs to be treated quite a bit differently than HDFS?

Thanks
Tobias

Reply via email to