Rohan, The short answer is: I don't know:-) If you could paste the log, I or someone else of the mailing list could be able to help.
BTW, What version of Hive were you using? Did you set the threshold before running the query? Try to find some documentation online if can tell what all properties need to be set before Skew Join. My understanding was that the 2 properties I mentioned below should suffice. Mark ----- Original Message ----- From: "rohan monga" <monga.ro...@gmail.com> To: user@hive.apache.org Cc: "Ayon Sinha" <ayonsi...@yahoo.com> Sent: Thursday, November 17, 2011 4:44:17 PM Subject: Re: Severely hit by "curse of last reducer" Hi Mark, I have tried setting hive.optimize.skewjoin=true, but it get a NullPointerException after the first stage of the query completes. Why does that happen? Thanks, -- Rohan Monga On Thu, Nov 17, 2011 at 1:37 PM, Mark Grover <mgro...@oanda.com> wrote: > Ayon, > I see. From what you explained, skew join seems like what you want. Have you > tried that already? > > Details on how skew join works are in this presentation. Jump to 15 minute > mark if you want to just listen about skew joins. > http://www.youtube.com/watch?v=OB4H3Yt5VWM > > I bet you could also find something in the mail list archives related to Skew > Join. > > In a nutshell (from the video), > set hive.optimize.skewjoin=true > set hive.skewjoin.key=<Threshold> > > should do the trick for you. Threshold, I believe, is the number of records > you consider a large number to defer till later. > > Good luck! > Mark > > ----- Original Message ----- > From: "Ayon Sinha" <ayonsi...@yahoo.com> > To: "Mark Grover" <mgro...@oanda.com>, user@hive.apache.org > Sent: Wednesday, November 16, 2011 10:53:19 PM > Subject: Re: Severely hit by "curse of last reducer" > > > > Only one reducer is always stuck. My table2 is small but using a Mapjoin > makes my mappers run out of memory. My max reducers is 32 (also max reduce > capacity). I tried setting num reducers to higher number (even 6000, which is > appx. combination of dates & names I have) only to have lots of reducers with > no data. > So I am quite sure its is some key in stage-1 thats is doing this. > > -Ayon > See My Photos on Flickr > Also check out my Blog for answers to commonly asked questions. > > > > > From: Mark Grover <mgro...@oanda.com> > To: user@hive.apache.org; Ayon Sinha <ayonsi...@yahoo.com> > Sent: Wednesday, November 16, 2011 6:54 PM > Subject: Re: Severely hit by "curse of last reducer" > > Hi Ayon, > Is it one particular reduce task that is slow or the entire reduce phase? How > many reduce tasks did you have, anyways? > > Looking into what the reducer key was might only make sense if a particular > reduce task was slow. > > If your table2 is small enough to fit in memory, you might want to try a map > join. > More details at: > http://www.facebook.com/note.php?note_id=470667928919 > > Let me know what you find. > > Mark > > ----- Original Message ----- > From: "Ayon Sinha" < ayonsi...@yahoo.com > > To: "Hive Mailinglist" < user@hive.apache.org > > Sent: Wednesday, November 16, 2011 9:03:23 PM > Subject: Severely hit by "curse of last reducer" > > > > Hi, > Where do I find the log of what reducer key is causing the last reducer to go > on for hours? The reducer logs don't say much about the key its processing. > Is there a way to enable a debug mode where it would log the key it's > processing? > > > My query looks like: > > > select partner_name, dates, sum(coins_granted) from table1 u join table2 p on > u.partner_id=p.id group by partner_name, dates > > > > My uncompressed size of table1 is about 30GB. > > -Ayon > See My Photos on Flickr > Also check out my Blog for answers to commonly asked questions. > > >