I am currently rewriting this page: 
http://samza.incubator.apache.org/learn/documentation/0.7.0/container/state-management.html
which includes a few examples of use cases where stateful processing is needed. 
Most of the examples are ok, but there's one which I don't find credible: the 
stream/stream join.

The example given is: you have a stream of ad impressions and a stream of ad 
clicks, and you want to join each click with its corresponding impression so 
that you can calculate the click-through rate. Unfortunately the example 
suffers from a few flaws:

- To calculate the CTR, you don't actually need to join individual events. You 
only need to count clicks and impressions (perhaps grouped by various 
dimensions in an OLAP cube). Clicks and impressions can be counted 
independently, so this is really an aggregation example, not a stream join 
example.

- You could argue that you need to join individual events because you want to 
include attributes of the impression in the analysis of the clicks (e.g. 
timestamp of impression). However, such attributes of the impression can be 
directly included in the click event (whatever tracks the click can remember 
the attributes of the impression, e.g. encoded in an URL). That would be much 
simpler than trying to join the streams after the fact.

Could someone enlighten me why the join of ad clicks and ad impressions is 
necessary? Or if not, does someone have a compelling and easy-to-understand 
example of stream/stream joins that I could include in the docs? I'm struggling 
to think of one myself, even though it must exist...

Thanks,
Martin

Reply via email to