Re: Calculations involve large datasets

Ted Dunning Fri, 22 Feb 2008 15:58:52 -0800


Joins are easy.

Just reduce on a key composed of the stuff you want to join on.  If the data
you are joining is disparate, leave some kind of hint about what kind of
record you have.

The reducer will be iterating through sets of records that have the same
key.  This is similar to the results of an outer join, except that if you
are joining A and B and there are multiple records with the join key in
either A or B, you will see them in the same reduce.  In many such cases, MR
is actually more efficient than a traditional join because you don't
necessarily want to generate the cross product of records.

In the reduce, you should build your composite record or do your composite
on a virtual composite join.

If you are doing a many-to-one join, then you often want the one to appear
before the many to avoid having to buffer the many until you see the one.
This can be done by sorting on your group key plus a source key, but
grouping on just the group key.

You should definitely look at Pig as well since it might fit what I would
presume to be a fairly SQL centric culture better than writing large Java
programs.  Last time I looked (a few months ago), it was definitely not
ready for us and we have gone other directions.  The pace of change has been
prodigous, however, so I expect it is much better than when I last looked
hard.

On 2/22/08 10:12 AM, "Tim Wintle" <[EMAIL PROTECTED]> wrote:

> Have you seen PIG:
> http://incubator.apache.org/pig/
> 
> It generates hadoop code and is more query like, and (as far as I
> remember) includes union, join, etc.
> 
> Tim
> 
> On Fri, 2008-02-22 at 09:13 -0800, Chuck Lan wrote:
>> Hi,
>> 
>> I'm currently looking into how to better scale the performance of our
>> calculations involving large sets of financial data.  It is currently using
>> a series of Oracle SQL statements to perform the calculations.  It seems to
>> me that the MapReduce algorithm may work in this scenario.  However, I
>> believe would need to perform some denormalization of data in order for this
>> to work.  Do I have to?  Or is there a good way to implement joins within
>> the Hadoop framework efficiently?
>> 
>> Thanks,
>> Chuck
>

Re: Calculations involve large datasets

Reply via email to