Re: Question about MapReduce

Kevin Peterson Thu, 15 Oct 2009 14:45:22 -0700

On Thu, Oct 15, 2009 at 2:20 PM, Something Something <luckyguy2...@yahoo.com
> wrote:


> I have 3 HTables.... Table1, Table2 & Table3.
> I have 3 different flat files.  One contains keys for Table1, 2nd contains
> keys for Table2 & 3rd contains keys for Table3.
>
> Use case:  For every combination of these 3 keys, I need to perform some
> complex calculation and save the result in another HTable.  In other words,
> I need to calculate values for the following combos:
>
> (1,1,1) (1,1,2).......   (1,1,N) (1,2,1) (1,3,1) & so on....
>
> So I figured the best way to do this is to start a MapReduce Job for each
> of these combinations.  The MapReduce will get (Key1, Key2, Key3) as input,
> then read Table1, Table2 & Table3 with these keys and perform the
> calculations.  Is this the correct approach?  If it is, I need to pass Key1,
> Key2 & Key3 to the Mapper & Reducer.  What's the best way to do this?
>
> So you need the Cartesian product of all these files. My recommendation:

Run three jobs which each read one of these files and set a flag in the row
of the appropriate table. This way, you don't need the files at all, you
just read some "flag:active" column in the tables.

Next, pick one of the tables. It doesn't really matter which one from a
logical standpoint, you could say table1, you could pick the one with the
most data in it, or you may pick the one iwth the most individual entries
flagged. Use it as input to tableinputformat, with a filter that only passes
through those rows that are flagged.

In the mapper, create a scanner over each of the other two columns using the
same filter. You have two nested loops inside your map. In the innermost
loop, be sure to updated a counter or call progress() to avoid the
jobtracker timing out.

Use tableoutputformat from that job to write to your output table.

Depending on what exactly it means when you get a row key in your original
input files, the next time through you will likely need to go through and
clear all the flags before starting the process again.

You definitely will not be starting multiple map reduce jobs. You will have
one map reduce job that iterates through all the possible combinations, and
your goal needs to be to make sure that the task can be split up enough that
it can be parallelized.

Re: Question about MapReduce

Reply via email to