Re: streaming joining multiple streams

2015-02-12 Thread Tathagata Das
Sorry for the late response. With the amount of data you are planning join,
any system would take time. However, between Hive's MapRduce joins, and
Spark's basic shuffle, and Spark SQL's join, the latter wins hands down.
Furthermore, with the APIs of Spark and Spark Streaming, you will have to
do strictly less work to set the infrastructure that you want to build.

Yes, Spark Streaming currently does not support providing own timer,
because the logic to handle delays etc, is pretty complex and specific to
each application. Usually that logic can be implemented on top of the
windowing solutoin that Spark Streaming already provides.

TD



On Thu, Feb 5, 2015 at 7:37 AM, Zilvinas Saltys zilvinas.sal...@gmail.com
wrote:

 The challenge I have is this. There's two streams of data where an event
 might look like this in stream1: (time, hashkey, foo1) and in stream2:
 (time, hashkey, foo2)
 The result after joining should be (time, hashkey, foo1, foo2) .. The join
 happens on hashkey and the time difference can be ~30 mins between events.
 The amount of data is enormous .. hundreds of billions of events per
 month. I need not only join the existing history data but continue to do so
 with incoming data (comes in batches not really streamed)

 For now I was thinking to implement this in MapReduce and sliding windows
 .. I'm wondering if spark can actually help me with this sort of challenge?
 How would a join of two huge streams of historic data would actually
 happen internally within spark and would it be more efficient than let's
 say hive map reduce stream join of two big tables?

 I also saw spark streaming has windowing support but it seems you cannot
 provide your own timer? As in I cannot make the time be derived from events
 itself rather than having an actual clock running.

 Thanks,



streaming joining multiple streams

2015-02-05 Thread Zilvinas Saltys
The challenge I have is this. There's two streams of data where an event
might look like this in stream1: (time, hashkey, foo1) and in stream2:
(time, hashkey, foo2)
The result after joining should be (time, hashkey, foo1, foo2) .. The join
happens on hashkey and the time difference can be ~30 mins between events.
The amount of data is enormous .. hundreds of billions of events per month.
I need not only join the existing history data but continue to do so with
incoming data (comes in batches not really streamed)

For now I was thinking to implement this in MapReduce and sliding windows
.. I'm wondering if spark can actually help me with this sort of challenge?
How would a join of two huge streams of historic data would actually happen
internally within spark and would it be more efficient than let's say hive
map reduce stream join of two big tables?

I also saw spark streaming has windowing support but it seems you cannot
provide your own timer? As in I cannot make the time be derived from events
itself rather than having an actual clock running.

Thanks,