Hi Martin, For me the value in joining imp and click streams is more in the last point you mentioned in terms of enrichment as opposed to CTR counting. Note though that (in my case at least) the aim isn't to enrich the click event with data from the imp but the reverse.
My ad servers generate literally dozens of discrete event streams most of which are very related to internal workings but which can usually be associated with an impression event. For me the value in joining these streams is to create much richer composite events that can then be pushed to other tasks which can be doing things such as CTR counting (aggregation as you say) but also can enable neartime processing on the composite records. That latter is particularly useful in regards to health/alert type behaviours. The various related events usually are quite temporally adjacent but things like clicks can trail behind which is where the stateful aspects are useful. Though a strategy to deal with stragglers is always needed. I appreciate this is maybe very industry specific; perhaps generalizing the example to discuss the grouping of related system events into a richer composite record in a way that is both more resilient and not dependent on in-memory buffering is the useful example to be drawn here? Regards Garry -----Original Message----- From: Martin Kleppmann [mailto:[email protected]] Sent: 31 May 2014 19:19 To: Samza dev list Subject: Example use cases for stream/stream join I am currently rewriting this page: http://samza.incubator.apache.org/learn/documentation/0.7.0/container/state-management.html which includes a few examples of use cases where stateful processing is needed. Most of the examples are ok, but there's one which I don't find credible: the stream/stream join. The example given is: you have a stream of ad impressions and a stream of ad clicks, and you want to join each click with its corresponding impression so that you can calculate the click-through rate. Unfortunately the example suffers from a few flaws: - To calculate the CTR, you don't actually need to join individual events. You only need to count clicks and impressions (perhaps grouped by various dimensions in an OLAP cube). Clicks and impressions can be counted independently, so this is really an aggregation example, not a stream join example. - You could argue that you need to join individual events because you want to include attributes of the impression in the analysis of the clicks (e.g. timestamp of impression). However, such attributes of the impression can be directly included in the click event (whatever tracks the click can remember the attributes of the impression, e.g. encoded in an URL). That would be much simpler than trying to join the streams after the fact. Could someone enlighten me why the join of ad clicks and ad impressions is necessary? Or if not, does someone have a compelling and easy-to-understand example of stream/stream joins that I could include in the docs? I'm struggling to think of one myself, even though it must exist... Thanks, Martin ----- No virus found in this message. Checked by AVG - www.avg.com Version: 2014.0.4570 / Virus Database: 3955/7589 - Release Date: 05/30/14
