Re: An explanation of LEFT OUTER JOIN and NULL values

2013-01-24 Thread David Morel
On 24 Jan 2013, at 20:39, bejoy...@yahoo.com wrote:

> Hi David,
>
> The default partitioner used in map reduce is the hash partitioner. So
> based on your keys they are send to a particular reducer.
>
> May be in your current data set, the keys that have no values in table
> are all falling in the same hash bucket and hence being processed by
> the same reducer.

Really not the case, no, so it doesn't make any sort of sense to me :\
At this point I am starting to think the only way to figure it out is to
set or add debugging at the reducer level. I hoped I could avoid it,
alas...

> If you are noticing a skew on a particular reducer, sometimes  a
> simple work around like increasing the no of reducers explicitly might
> help you get pass the hurdle.

Yes, that seemed to work in some cases, but is not very handy: it's
hard to tell my users 'try setting a different number of reducers when
your query is stuck' ;-)

> Also please ensure you have enabled skew join optimization.

Didn't have much success with this, but maybe my version of hive is a
bit old.

So to emulate a LEFT OUTER JOIN I had to do something really horrible:

JOIN (
-- TABLE B is the joined table
SELECT key, value FROM TABLE B
UNION ALL
-- TABLE A has all the unique ids 
SELECT key, '' FROM TABLE A 
) AS SUB

and then use a count -1 to get a count of non-null rows, etc. At least 
it does work with no slowdowns, but I mean, yuk!

It's the best I could come up with so far, so if I could fully understand
the root cause of the problem, that would be much better. I guess I'll 
dig in a bit deeper then.

Thanks a lot!

David

>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
>
> -Original Message-
> From: "David Morel" 
> Date: Thu, 24 Jan 2013 18:39:56
> To: ; 
> Reply-To: user@hive.apache.org
> Subject: Re: An explanation of LEFT OUTER JOIN and NULL values
>
> On 24 Jan 2013, at 18:16, bejoy...@yahoo.com wrote:
>
>> Hi David
>>
>> An explain extended would give you the exact pointer.
>>
>> From my understanding, this is how it could work.
>>
>> You have two tables then two different map reduce job would be
>> processing those. Based on the join keys, combination of corresponding
>> columns would be chosen as key from mapper1 and mapper2. So if the
>> combination of columns having the same value those records from two
>> set of mappers would go into the same reducer.
>>
>> On the reducer if there is a corresponding value for a key from table
>> 1 to  table 2/mapper 2 that value would be populated. If no val for
>> mapper 2 then those columns from table 2 are made null.
>>
>> If there is a key-value just from table 2/mapper 2 and no
>> corresponding value from mapper 1. That value is just discarded.
>
> Hi Bejoy,
>
> Thanks! So schematically, something like this, right?
>
> mapper1 (bigger table):
> K1-A, V1A
> K2-A, V2A
> K3-A, V3A
>
> mapper2 (joined, smaller table):
> K1-B, V1B
>
> reducer1:
> K1-A, V1A
> K1-B, V1B
>
> returns:
> K1, V1A, V1B etc
>
> reducer2:
> K2-A, V2A
> *no* K2-B, V so: K2-B, NULL is created, same for next row.
> K3-A, V3A
>
> returns:
> K2, V2A, NULL etc
> K3, V3A, NULL etc
>
> I still don't understand why my reducer2 (and only this one, which
> apparently gets all the keys for which we don't have a row on table B)
> would become overloaded. Am I completely misunderstanding the whole
> thing?
>
> David
>
>> Regards
>> Bejoy KS
>>
>> Sent from remote device, Please excuse typos
>>
>> -Original Message-
>> From: "David Morel" 
>> Date: Thu, 24 Jan 2013 18:03:40
>> To: user@hive.apache.org
>> Reply-To: user@hive.apache.org
>> Subject: An explanation of LEFT OUTER JOIN and NULL values
>>
>> Hi!
>>
>> After hitting the "curse of the last reducer" many times on LEFT OUTER
>> JOIN queries, and trying to think about it, I came to the conclusion
>> there's something I am missing regarding how keys are handled in
>> mapred jobs.
>>
>> The problem shows when I have table A containing billions of rows with
>> distinctive keys, that I need to join to table B that has a much lower
>> number of rows.
>>
>> I need to keep all the A rows, populated with NULL values from the B
>> side, so that's what a LEFT OUTER is for.
>>
>> Now, when transforming that into a mapred job, my -naive-
>> understanding would be that for every key on the A table, a missing
>> key on the B table would be generated with a NULL value. If that were
>> the case, I fail to understand why all NULL valued B keys would end up
>> on the same reducer, since the key defines which reducer is used, not
>> the value.
>>
>> So, obviously, this is not how it works.
>>
>> So my question is: how is this construct handled?
>>
>> Thanks a lot!
>>
>> D.Morel



Re: An explanation of LEFT OUTER JOIN and NULL values

2013-01-24 Thread bejoy_ks
Hi David,

The default partitioner used in map reduce is the hash partitioner. So based on 
your keys they are send to a particular reducer.

May be in your current data set, the keys that have no values in table are all 
falling in the same hash bucket and hence being processed by the same reducer.

If you are noticing a skew on a particular reducer, sometimes  a simple work 
around like increasing the no of reducers explicitly might help you get pass 
the hurdle.

Also please ensure you have enabled skew join optimization.
 
Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: "David Morel" 
Date: Thu, 24 Jan 2013 18:39:56 
To: ; 
Reply-To: user@hive.apache.org
Subject: Re: An explanation of LEFT OUTER JOIN and NULL values

On 24 Jan 2013, at 18:16, bejoy...@yahoo.com wrote:

> Hi David
>
> An explain extended would give you the exact pointer.
>
> From my understanding, this is how it could work.
>
> You have two tables then two different map reduce job would be
> processing those. Based on the join keys, combination of corresponding
> columns would be chosen as key from mapper1 and mapper2. So if the
> combination of columns having the same value those records from two
> set of mappers would go into the same reducer.
>
> On the reducer if there is a corresponding value for a key from table
> 1 to  table 2/mapper 2 that value would be populated. If no val for
> mapper 2 then those columns from table 2 are made null.
>
> If there is a key-value just from table 2/mapper 2 and no
> corresponding value from mapper 1. That value is just discarded.

Hi Bejoy,

Thanks! So schematically, something like this, right?

mapper1 (bigger table):
K1-A, V1A
K2-A, V2A
K3-A, V3A

mapper2 (joined, smaller table):
K1-B, V1B

reducer1:
K1-A, V1A 
K1-B, V1B

returns:
K1, V1A, V1B etc

reducer2:
K2-A, V2A
*no* K2-B, V so: K2-B, NULL is created, same for next row.
K3-A, V3A

returns:
K2, V2A, NULL etc
K3, V3A, NULL etc

I still don't understand why my reducer2 (and only this one, which
apparently gets all the keys for which we don't have a row on table B)
would become overloaded. Am I completely misunderstanding the whole
thing?

David

> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
>
> -Original Message-
> From: "David Morel" 
> Date: Thu, 24 Jan 2013 18:03:40
> To: user@hive.apache.org
> Reply-To: user@hive.apache.org
> Subject: An explanation of LEFT OUTER JOIN and NULL values
>
> Hi!
>
> After hitting the "curse of the last reducer" many times on LEFT OUTER
> JOIN queries, and trying to think about it, I came to the conclusion
> there's something I am missing regarding how keys are handled in
> mapred jobs.
>
> The problem shows when I have table A containing billions of rows with
> distinctive keys, that I need to join to table B that has a much lower
> number of rows.
>
> I need to keep all the A rows, populated with NULL values from the B
> side, so that's what a LEFT OUTER is for.
>
> Now, when transforming that into a mapred job, my -naive-
> understanding would be that for every key on the A table, a missing
> key on the B table would be generated with a NULL value. If that were
> the case, I fail to understand why all NULL valued B keys would end up
> on the same reducer, since the key defines which reducer is used, not
> the value.
>
> So, obviously, this is not how it works.
>
> So my question is: how is this construct handled?
>
> Thanks a lot!
>
> D.Morel



Re: An explanation of LEFT OUTER JOIN and NULL values

2013-01-24 Thread David Morel
On 24 Jan 2013, at 18:16, bejoy...@yahoo.com wrote:

> Hi David
>
> An explain extended would give you the exact pointer.
>
> From my understanding, this is how it could work.
>
> You have two tables then two different map reduce job would be
> processing those. Based on the join keys, combination of corresponding
> columns would be chosen as key from mapper1 and mapper2. So if the
> combination of columns having the same value those records from two
> set of mappers would go into the same reducer.
>
> On the reducer if there is a corresponding value for a key from table
> 1 to  table 2/mapper 2 that value would be populated. If no val for
> mapper 2 then those columns from table 2 are made null.
>
> If there is a key-value just from table 2/mapper 2 and no
> corresponding value from mapper 1. That value is just discarded.

Hi Bejoy,

Thanks! So schematically, something like this, right?

mapper1 (bigger table):
K1-A, V1A
K2-A, V2A
K3-A, V3A

mapper2 (joined, smaller table):
K1-B, V1B

reducer1:
K1-A, V1A 
K1-B, V1B

returns:
K1, V1A, V1B etc

reducer2:
K2-A, V2A
*no* K2-B, V so: K2-B, NULL is created, same for next row.
K3-A, V3A

returns:
K2, V2A, NULL etc
K3, V3A, NULL etc

I still don't understand why my reducer2 (and only this one, which
apparently gets all the keys for which we don't have a row on table B)
would become overloaded. Am I completely misunderstanding the whole
thing?

David

> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
>
> -Original Message-
> From: "David Morel" 
> Date: Thu, 24 Jan 2013 18:03:40
> To: user@hive.apache.org
> Reply-To: user@hive.apache.org
> Subject: An explanation of LEFT OUTER JOIN and NULL values
>
> Hi!
>
> After hitting the "curse of the last reducer" many times on LEFT OUTER
> JOIN queries, and trying to think about it, I came to the conclusion
> there's something I am missing regarding how keys are handled in
> mapred jobs.
>
> The problem shows when I have table A containing billions of rows with
> distinctive keys, that I need to join to table B that has a much lower
> number of rows.
>
> I need to keep all the A rows, populated with NULL values from the B
> side, so that's what a LEFT OUTER is for.
>
> Now, when transforming that into a mapred job, my -naive-
> understanding would be that for every key on the A table, a missing
> key on the B table would be generated with a NULL value. If that were
> the case, I fail to understand why all NULL valued B keys would end up
> on the same reducer, since the key defines which reducer is used, not
> the value.
>
> So, obviously, this is not how it works.
>
> So my question is: how is this construct handled?
>
> Thanks a lot!
>
> D.Morel



Re: An explanation of LEFT OUTER JOIN and NULL values

2013-01-24 Thread bejoy_ks
Hi David

An explain extended would give you the exact pointer.

From my understanding, this is how it could work. 

 You have two tables then two different map reduce job would be processing 
those. Based on the join keys, combination of corresponding columns would be 
chosen as key from mapper1 and mapper2. So if the combination of columns having 
the same value those records from two set of mappers would go into the same 
reducer.

On the reducer if there is a corresponding value for a key from table 1 to  
table 2/mapper 2 that value would be populated. If no val for mapper 2 then 
those columns from table 2 are made null.

If there is a key-value just from table 2/mapper 2 and no  corresponding value 
from mapper 1. That value is just discarded.


Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: "David Morel" 
Date: Thu, 24 Jan 2013 18:03:40 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: An explanation of LEFT OUTER JOIN and NULL values

Hi!

After hitting the "curse of the last reducer" many times on LEFT OUTER
JOIN queries, and trying to think about it, I came to the conclusion
there's something I am missing regarding how keys are handled in mapred
jobs.

The problem shows when I have table A containing billions of rows with
distinctive keys, that I need to join to table B that has a much lower
number of rows.

I need to keep all the A rows, populated with NULL values from the B
side, so that's what a LEFT OUTER is for.

Now, when transforming that into a mapred job, my -naive- understanding
would be that for every key on the A table, a missing key on the B table
would be generated with a NULL value. If that were the case, I fail to
understand why all NULL valued B keys would end up on the same reducer,
since the key defines which reducer is used, not the value.

So, obviously, this is not how it works.

So my question is: how is this construct handled?

Thanks a lot!

D.Morel


An explanation of LEFT OUTER JOIN and NULL values

2013-01-24 Thread David Morel
Hi!

After hitting the "curse of the last reducer" many times on LEFT OUTER
JOIN queries, and trying to think about it, I came to the conclusion
there's something I am missing regarding how keys are handled in mapred
jobs.

The problem shows when I have table A containing billions of rows with
distinctive keys, that I need to join to table B that has a much lower
number of rows.

I need to keep all the A rows, populated with NULL values from the B
side, so that's what a LEFT OUTER is for.

Now, when transforming that into a mapred job, my -naive- understanding
would be that for every key on the A table, a missing key on the B table
would be generated with a NULL value. If that were the case, I fail to
understand why all NULL valued B keys would end up on the same reducer,
since the key defines which reducer is used, not the value.

So, obviously, this is not how it works.

So my question is: how is this construct handled?

Thanks a lot!

D.Morel