Joins are easy.
Just reduce on a key composed of the stuff you want to join on. If the data you are joining is disparate, leave some kind of hint about what kind of record you have. The reducer will be iterating through sets of records that have the same key. This is similar to the results of an outer join, except that if you are joining A and B and there are multiple records with the join key in either A or B, you will see them in the same reduce. In many such cases, MR is actually more efficient than a traditional join because you don't necessarily want to generate the cross product of records. In the reduce, you should build your composite record or do your composite on a virtual composite join. If you are doing a many-to-one join, then you often want the one to appear before the many to avoid having to buffer the many until you see the one. This can be done by sorting on your group key plus a source key, but grouping on just the group key. You should definitely look at Pig as well since it might fit what I would presume to be a fairly SQL centric culture better than writing large Java programs. Last time I looked (a few months ago), it was definitely not ready for us and we have gone other directions. The pace of change has been prodigous, however, so I expect it is much better than when I last looked hard. On 2/22/08 10:12 AM, "Tim Wintle" <[EMAIL PROTECTED]> wrote: > Have you seen PIG: > http://incubator.apache.org/pig/ > > It generates hadoop code and is more query like, and (as far as I > remember) includes union, join, etc. > > Tim > > On Fri, 2008-02-22 at 09:13 -0800, Chuck Lan wrote: >> Hi, >> >> I'm currently looking into how to better scale the performance of our >> calculations involving large sets of financial data. It is currently using >> a series of Oracle SQL statements to perform the calculations. It seems to >> me that the MapReduce algorithm may work in this scenario. However, I >> believe would need to perform some denormalization of data in order for this >> to work. Do I have to? Or is there a good way to implement joins within >> the Hadoop framework efficiently? >> >> Thanks, >> Chuck >