On 24 Jan 2013, at 18:16, bejoy...@yahoo.com wrote:

> Hi David
>
> An explain extended would give you the exact pointer.
>
> From my understanding, this is how it could work.
>
> You have two tables then two different map reduce job would be
> processing those. Based on the join keys, combination of corresponding
> columns would be chosen as key from mapper1 and mapper2. So if the
> combination of columns having the same value those records from two
> set of mappers would go into the same reducer.
>
> On the reducer if there is a corresponding value for a key from table
> 1 to  table 2/mapper 2 that value would be populated. If no val for
> mapper 2 then those columns from table 2 are made null.
>
> If there is a key-value just from table 2/mapper 2 and no
> corresponding value from mapper 1. That value is just discarded.

Hi Bejoy,

Thanks! So schematically, something like this, right?

mapper1 (bigger table):
K1-A, V1A
K2-A, V2A
K3-A, V3A

mapper2 (joined, smaller table):
K1-B, V1B

reducer1:
K1-A, V1A 
K1-B, V1B

returns:
K1, V1A, V1B etc

reducer2:
K2-A, V2A
*no* K2-B, V so: K2-B, NULL is created, same for next row.
K3-A, V3A

returns:
K2, V2A, NULL etc
K3, V3A, NULL etc

I still don't understand why my reducer2 (and only this one, which
apparently gets all the keys for which we don't have a row on table B)
would become overloaded. Am I completely misunderstanding the whole
thing?

David

> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
>
> -----Original Message-----
> From: "David Morel" <dmore...@gmail.com>
> Date: Thu, 24 Jan 2013 18:03:40
> To: user@hive.apache.org<user@hive.apache.org>
> Reply-To: user@hive.apache.org
> Subject: An explanation of LEFT OUTER JOIN and NULL values
>
> Hi!
>
> After hitting the "curse of the last reducer" many times on LEFT OUTER
> JOIN queries, and trying to think about it, I came to the conclusion
> there's something I am missing regarding how keys are handled in
> mapred jobs.
>
> The problem shows when I have table A containing billions of rows with
> distinctive keys, that I need to join to table B that has a much lower
> number of rows.
>
> I need to keep all the A rows, populated with NULL values from the B
> side, so that's what a LEFT OUTER is for.
>
> Now, when transforming that into a mapred job, my -naive-
> understanding would be that for every key on the A table, a missing
> key on the B table would be generated with a NULL value. If that were
> the case, I fail to understand why all NULL valued B keys would end up
> on the same reducer, since the key defines which reducer is used, not
> the value.
>
> So, obviously, this is not how it works.
>
> So my question is: how is this construct handled?
>
> Thanks a lot!
>
> D.Morel

Reply via email to