On 24 Jan 2013, at 18:16, bejoy...@yahoo.com wrote: > Hi David > > An explain extended would give you the exact pointer. > > From my understanding, this is how it could work. > > You have two tables then two different map reduce job would be > processing those. Based on the join keys, combination of corresponding > columns would be chosen as key from mapper1 and mapper2. So if the > combination of columns having the same value those records from two > set of mappers would go into the same reducer. > > On the reducer if there is a corresponding value for a key from table > 1 to table 2/mapper 2 that value would be populated. If no val for > mapper 2 then those columns from table 2 are made null. > > If there is a key-value just from table 2/mapper 2 and no > corresponding value from mapper 1. That value is just discarded.
Hi Bejoy, Thanks! So schematically, something like this, right? mapper1 (bigger table): K1-A, V1A K2-A, V2A K3-A, V3A mapper2 (joined, smaller table): K1-B, V1B reducer1: K1-A, V1A K1-B, V1B returns: K1, V1A, V1B etc reducer2: K2-A, V2A *no* K2-B, V so: K2-B, NULL is created, same for next row. K3-A, V3A returns: K2, V2A, NULL etc K3, V3A, NULL etc I still don't understand why my reducer2 (and only this one, which apparently gets all the keys for which we don't have a row on table B) would become overloaded. Am I completely misunderstanding the whole thing? David > Regards > Bejoy KS > > Sent from remote device, Please excuse typos > > -----Original Message----- > From: "David Morel" <dmore...@gmail.com> > Date: Thu, 24 Jan 2013 18:03:40 > To: user@hive.apache.org<user@hive.apache.org> > Reply-To: user@hive.apache.org > Subject: An explanation of LEFT OUTER JOIN and NULL values > > Hi! > > After hitting the "curse of the last reducer" many times on LEFT OUTER > JOIN queries, and trying to think about it, I came to the conclusion > there's something I am missing regarding how keys are handled in > mapred jobs. > > The problem shows when I have table A containing billions of rows with > distinctive keys, that I need to join to table B that has a much lower > number of rows. > > I need to keep all the A rows, populated with NULL values from the B > side, so that's what a LEFT OUTER is for. > > Now, when transforming that into a mapred job, my -naive- > understanding would be that for every key on the A table, a missing > key on the B table would be generated with a NULL value. If that were > the case, I fail to understand why all NULL valued B keys would end up > on the same reducer, since the key defines which reducer is used, not > the value. > > So, obviously, this is not how it works. > > So my question is: how is this construct handled? > > Thanks a lot! > > D.Morel