Structuring MapReduce Jobs

Tom Ferguson Mon, 09 Apr 2012 09:40:38 -0700

Hello,

I'm very new to Hadoop and I am trying to carry out of proof of concept for
processing some trading data. I am from a .net background, so I am trying
to prove whether it can be done primarily using C#, therefore I am looking
at the Hadoop Streaming job (from the Hadoop examples) to call in to some
C# executables.


My problem is, I am not certain of the best way to structure my jobs to
process the data in the way I want.

I have data stored in an RDBMS in the following format:

ID TradeID  Date  Value
---------------------------------------------
1 1  2012-01-01 12.34
2 1  2012-01-02 12.56
3 1  2012-01-03 13.78
4 2  2012-01-04 18.94
5 2  2012-05-17 19.32
6 2  2012-05-18 19.63
7 3  2012-05-19 17.32
What I want to do is take all the Dates & Values for a given TradeID into a
mathematical function that will spit out the same set of Dates but will
have recalculated all the Values. I hope that makes sense.. e.g.

Date Value
---------------------------
2012-01-01 12.34
2012-01-02 12.56
2012-01-03 13.78
will have the mathematical function applied and spit out

Date Value
---------------------------
2012-01-01 28.74
2012-01-02 31.29
2012-01-03 29.93
I am not exactly sure how to achieve this using Hadoop Streaming, but my
thoughts so far are...


   1. Us Sqoop to take the data out of the RDBMS and in to HDFS and split
   by TradeID - will this guarantee that all the the data points for a given
   TradeID will be processed by the same Map task??
   2. Write a Map task as a C# executable that will stream data in in the
   format (ID, TradeID, Date, Value)
   3. Gather all the data points for a given TradeID together into an array
   (or other datastructure)
   4. Pass the array into the mathematical function
   5. Get the results back as another array
   6. Stream the results back out in the format (TradeID, Date, ResultValue)

I will have around 500,000 Trade IDs, with up to 3,000 data points each, so
I am hoping that the data/processing will be distributed appropriately by
Hadoop.

Now, this seams a little bit long winded, but is this the best way of doing
it, based on the constraints of having to use C# for writing my tasks? In
the example above I do not have a Reduce job at all. Is that right in my
scenario?

Thanks for any help you can give and apologies if I am asking stupid
questions here!

Kind Regards,

Tom

Structuring MapReduce Jobs

Reply via email to