Ahh yes, it is absolutely critical that the partitioning in all of the
sets be the same :)
We are currently assuming that:
1) Default Partitioner
2) Same Reduce Count
3) Text keys
will guarantee that, but as I mentioned above, we are assuming ;)
Testing is just starting
Chris Douglas wrote:
Sor
Sorry, I meant splits (partitions of input data). If you have n
datasets and m splits per dataset, m_i must contain the same keys for
all n. So if you're joining two datasets A and B sharing a key k, if
split i from A contains any instances of k, (a) split i from A must
contain all instance
We are using the default partitioner. I am just about to start verifying
my result as it took quite a while to work my way through the in-obvious
issues of hand writing MapFiles, thinks like the key and value class are
extracted from the jobconf, output key/value.
Question: I looked at the Has
Forgive me if you already know this, but the correctness of the map-
side join is very sensitive to partitioning; if your input in sorted
but equal keys go to different partitions, your results may be
incorrect. Is your input such that the default partitioning is
sufficient? Have you verifie
For the data joins, I let the framework do it - which means one
partition per split - so I have to chose my partition count carefully to
fill the machines.
I had an error in my initial outer join mapper, the join map code now
runs about 40x faster than the old brute force read it all shuffle &
Hi Jason-
It only seems like full outer or full inner joins are supported. I
was hoping to just do a left outer join.
Is this supported or planned?
The full inner/outer joins are examples, really. You can define your
own operations by extending o.a.h.mapred.join.JoinRecordReader or
o.a
It only seems like full outer or full inner joins are supported. I was
hoping to just do a left outer join.
Is this supported or planned?
On the flip side doing the Outer Join is about 8x faster than doing a
map/reduce over our dataset.
Thanks
--
Jason Venner
Attributor - Program the Web