We are using the default partitioner. I am just about to start verifying
my result as it took quite a while to work my way through the in-obvious
issues of hand writing MapFiles, thinks like the key and value class are
extracted from the jobconf, output key/value.
Question: I looked at the HashPartitioner (which we are using) and a
key's partition is simply based on the key.hashCode() %
conf.getNumReduces().
How will I get equal keys going to different partitions - clearly I
have an understanding gap.
Thanks!
Chris Douglas wrote:
Forgive me if you already know this, but the correctness of the
map-side join is very sensitive to partitioning; if your input in
sorted but equal keys go to different partitions, your results may be
incorrect. Is your input such that the default partitioning is
sufficient? Have you verified the correctness of your results? -C
On Jul 2, 2008, at 9:55 PM, Jason Venner wrote:
For the data joins, I let the framework do it - which means one
partition per split - so I have to chose my partition count carefully
to fill the machines.
I had an error in my initial outer join mapper, the join map code now
runs about 40x faster than the old brute force read it all shuffle &
sort.
Chris Douglas wrote:
Hi Jason-
It only seems like full outer or full inner joins are supported. I
was hoping to just do a left outer join.
Is this supported or planned?
The full inner/outer joins are examples, really. You can define your
own operations by extending o.a.h.mapred.join.JoinRecordReader or
o.a.h.mapred.join.MultiFilterRecordReader and registering your new
identifier with the parser by defining a property
"mapred.join.define.<ident>" as your class.
For a left outer join, JoinRecordReader is the correct base.
InnerJoinRecordReader and OuterJoinRecordReader should make its use
clear.
On the flip side doing the Outer Join is about 8x faster than doing
a map/reduce over our dataset.
Cool! Out of curiosity, how are you managing your splits? -C
--
Jason Venner
Attributor - Program the Web <http://www.attributor.com/>
Attributor is hiring Hadoop Wranglers and coding wizards, contact if
interested