The challenge I have is this. There's two streams of data where an event might look like this in stream1: (time, hashkey, foo1) and in stream2: (time, hashkey, foo2) The result after joining should be (time, hashkey, foo1, foo2) .. The join happens on hashkey and the time difference can be ~30 mins between events. The amount of data is enormous .. hundreds of billions of events per month. I need not only join the existing history data but continue to do so with incoming data (comes in batches not really streamed)
For now I was thinking to implement this in MapReduce and sliding windows .. I'm wondering if spark can actually help me with this sort of challenge? How would a join of two huge streams of historic data would actually happen internally within spark and would it be more efficient than let's say hive map reduce stream join of two big tables? I also saw spark streaming has windowing support but it seems you cannot provide your own timer? As in I cannot make the time be derived from events itself rather than having an actual clock running. Thanks,