I need to do a reduce-side join of two datasets. It's a many-to-many
join; that is, each dataset can can multiple records with any given
key.
Every description of a reduce-side join I've seen involves
constructing your keys out of your mapper such that records from one
dataset will be presented
Hi Stuart,
It seems to me like you have a few options.
Option 1: Just use a lot of RAM. Unless you really expect many millions of
entries on both sides of the join, you might be able to get away with
buffering despite its inefficiency.
Option 2: Use LocalDirAllocator to find some local storage
One last possible trick to consider:
If you were to subclass SequenceFileRecordReader, you'd have access to its
seek method, allowing you to rewind the reducer input. You could then
implement a block hash join with something like the following pseudocode:
ahash = new HashMapKey, Val();
while (i
I believe PIG, and I know Cascading use a kind of 'spillable' list
that can be re-iterated across. PIG's version is a bit more
sophisticated last I looked.
that said, if you were using either one of them, you wouldn't need to
write your own many-to-many join.
cheers,
ckw
On May 28,
Use the mapside join stuff, if I understand your problem it provides a good
solution but requires getting over the learning hurdle.
Well described in chapter 8 of my book :)
On Thu, May 28, 2009 at 8:29 AM, Chris K Wensel ch...@wensel.net wrote:
I believe PIG, and I know Cascading use a kind