Hi, I currently have a project to process the data using MR. I have some 
thoughts about it, and am looking for some advices if anyone had any feedback.
Currently in this project, I have lot of events data related to email tracking 
coming into the HDFC. So the events are the data for email tracking, like 
email_sent, email_open, email_bnc, link_click etc. Our online system can give 
me the data in the following 2 kinds of format in raw data:
1) The delta data list the event by type, by timestamp. For example:email_sent, 
t1, .......email_read, t2, .......email_sent, t3, ......
If I choose this data format, I can put different type into different data set, 
and just store them in the HDFS, partitioned by Time. This is the easiest for 
ETL, but not very useful to use the data, as most business analyzing want to 
link all the event as a chain, like email_sent, email_read, email_click, 
email_bounce etc. 
So I want to join the data in the ETL, and store them in the HDFS in the way it 
will be used most likely by the end user. 
But linking the email event is very expensive, as it is hard to find out this 
email_read event is for which email_sent, and most importantly, to get the 
original email_sent timestamp.
Fortunately, our online system (Stored in big Cassandra cluster), can give me 
data in another format:
2) The delta data include the whole email chain for the whole delta period.For 
example:email_sent, t1 .....email_sent, email_read, t2......email_sent, 
t3.......email_sent, email_read, link_click, t4 .......
But here is the trade off, even though it is a delta, but it doesn't ONLY 
contain delta. For example, in the above example, the 2nd line data, it is an 
email read event, and gives me the linking email_sent nicely, but the original 
email_sent event most likely already gave to me in any previous delta data. So 
I have to merge the email_read to the original email_sent, which already 
existed in the HDFS, and which only supports append. To make this whole thing 
work, I have to replace the original record in the HDFS.
We have about 200-300M events generated per day, so it is a challenge to make 
it right.
Here is my initial thoughts, and look for any feedback and advices:
1) Ideally, the data will store in the HDFS, partitioned by the original email 
sent timestamp, hourly.2) Within each hour, it maybe a good idea to store as 
the map file, using (email_address + timestamp) as the key. So index can be 
built on top of that, make the lookup fast later.3) When the delta comes in 
(design for daily as first step), choose the 2nd format above as the source raw 
data, using the first MR job to group the data hourly based on the email_sent. 
Here is one of my question I am not sure how to best address it.In each 
reducer, I will have all the data which is having changes for the email 
originally sent in that hour, and I want to join them back to the hour of data, 
to do the replacing.  But I don't think I can just do this in the reducer 
directly, right? Ideally, at this time, I want to do a merge join, maybe based 
on the timestamp. I have to save the reducer output in HDFS some temp location, 
and do the join again in another MR, right? Without the 1st MR job, I won't 
know how many hour partition data will be touched by this new incoming delta.4) 
The challenge part is in the delta data, it also will contain a lot new 
email_sent events. So the above join really has to be a full outer join.5) The 
data pattern of the new_email_sent vs old_update is about 9:1, at least based 
on my current research. Will it really make sense to get the data from both 
above 1) and 2) format? As from 1), I can get all the new email_sent, and 
discard the rest. Then I need to go to 2 to identify the part needs to be 
merged. But in this case, I have to consume 2 big data dumps, which sounds bad 
idea.
Bottom line, I would like to know:1) What file format I should consider to 
store my data, map file makes sense?2) Any part I need to pay attention to 
design these MR jobs? How to make the whole thing efficient?3) Even in the map 
file, what format I should consider to serialize the value object? Should I use 
google Protobuf? or Apache Avro? or something else, and why?
Thanks                                    

Reply via email to