Hi
I have a question regarding design trade offs and best practices. I'm working 
on a real time analytics system. For simplicity, I have data with timestamps 
(the key) and counts (the value). I use DStreams for the real time aspect. 
Tuples w the same timestamp can be across various RDDs and I just aggregate 
each RDD by timestamp and increment counters in Cassandra; this gives correct 
aggregation counts based on data timestamp.
At the same time as tuple aggregations as saved into Cassandra I also show the 
aggregations on a chart and also pass the data through some more complicated 
math formulas (they output DStreams) which involve using updateStateByKey. 
These other operations are similar to moving average in the way that if data 
comes late you'd have to recalculate all moving averages starting from the date 
of the last delayed tuple; such is the nature of these calculations. The 
calculations are not saved in db but recalculated every time data is loaded 
from db.

Now, in real time I do things on a best effort basis, but the database will 
always have correct aggregations (even if tuples come in late for some early 
timestamp Cassandra will easily increment a counter w amount from this late 
tuple).
In real time, when a tuple w same timestamp belongs to several RDDs, I don't 
aggregate by tuple timestamp (bc that would mean reapplying the math formulas 
from the timestamp of the last tuple and that is too much overhead) instead I 
aggregate by RDD time which is system time when the RDD is created. This is 
good enough for real time.

So now you can see that the db (the truth provider) is different from real time 
streaming results (best effort).

My questions:
1. From your experience, is this design I just described appropriate?
2. I'm curious how others have solved the problem of reconciling diferences in 
their real time processing w batch mode. I think I read on the mailig list 
(several months ago) that someone re does the aggregation step an hour after 
data is received (ie aggregation DStream job is always an hour behind so that 
way late tuples have time to propagate to the db)
3. In case the source of data fails and it is restarted later, my design will 
give duplicates unless the tuples from the database are deleted for the 
timestamps that the data I am re-streaming contains. Is there a better way to 
avoid duplicates if running the same job twice or part of a bigger job. 
(idempotency)


Thanks
-Adrian

Reply via email to