The challenge I have is this. There's two streams of data where an event
might look like this in stream1: (time, hashkey, foo1) and in stream2:
(time, hashkey, foo2)
The result after joining should be (time, hashkey, foo1, foo2) .. The join
happens on hashkey and the time difference can be ~30 mins between events.
The amount of data is enormous .. hundreds of billions of events per month.
I need not only join the existing history data but continue to do so with
incoming data (comes in batches not really streamed)

For now I was thinking to implement this in MapReduce and sliding windows
.. I'm wondering if spark can actually help me with this sort of challenge?
How would a join of two huge streams of historic data would actually happen
internally within spark and would it be more efficient than let's say hive
map reduce stream join of two big tables?

I also saw spark streaming has windowing support but it seems you cannot
provide your own timer? As in I cannot make the time be derived from events
itself rather than having an actual clock running.

Thanks,

Reply via email to