Hello All, Thanks a lot for your response. To clarify a few points -
I am on CDH2 with Hive 0.4 (I think). We cannot move to a higher version of Hive as we have to use Cloudera distro only. All records in the smaller table have at least one record in the larger table (of course a few exceptions could be there but only a few). The join is using ON clause. The query is something like - select ... from ( (select ... from smaller_table) join (select from larger_table) on (smaller_table.col = larger_table.col) ) I will try out setting mapred.child.java.opts -Xmx to a higher value and let you know. Is there a pattern or rule of thumb to follow on when to add more nodes? Thanks again! On Thu, Mar 17, 2011 at 1:08 AM, Steven Wong <sw...@netflix.com> wrote: > In addition, put the smaller table on the left-hand side of a JOIN: > > > > SELECT ... FROM small_table JOIN large_table ON ... > > > > > > *From:* Bejoy Ks [mailto:bejoy...@yahoo.com] > *Sent:* Wednesday, March 16, 2011 11:43 AM > > *To:* user@hive.apache.org > *Subject:* Re: Hadoop error 2 while joining two large tables > > > > Hey hadoop n00b > I second Mark's thought. But definitely you can try out re framing your > query to get things rolling. I'm not sure on your hive Query.But still, from > my experience with joins on huge tables (record counts in the range of > hundreds of millions) you should give join conditions with JOIN ON clause > rather than specifying all conditions in WHERE. > > Say if you have a query this way > SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b WHERE > a.Column4=b.Column1 AND a.Column2=b.Column4 AND a.Column3 > b.Column2; > > You can definitely re frame this query as > SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b > ON (a.Column4=b.Column1 AND a.Column2=b.Column4) WHERE a.Column3 > > b.Column2; > > From my understanding Hive supports equijoins so you can't have the > inequality conditions there within JOIN ON, inequality should come to WHERE. > This approach has worked for me when I encountered a similar situation as > yours some time ago. Try this out,hope it helps. > > Regards > Bejoy.K.S > > > > > ------------------------------ > > *From:* "Sunderlin, Mark" <mark.sunder...@teamaol.com> > *To:* "user@hive.apache.org" <user@hive.apache.org> > *Sent:* Wed, March 16, 2011 11:22:09 PM > *Subject:* RE: Hadoop error 2 while joining two large tables > > > hadoop n00b asks, “Is adding more nodes the solution to such problem?” > > > > Whatever else answers you get, you should append “ … and add more nodes.” > More nodes is never a bad thing ;-) > > > > --- > > *Mark E. Sunderlin* > > *Solutions Architect **|AOL **Data Warehouse* > > P: 703-256-6935 | C: 540-327-6222 > > AIM: MESunderlin > > 22000 AOL Way | Dulles, VA | 20166 > > > > > > *From:* hadoop n00b [mailto:new2h...@gmail.com] > *Sent:* Wednesday, March 16, 2011 3:33 AM > *To:* user@hive.apache.org > *Subject:* Fwd: Hadoop error 2 while joining two large tables > > > > Hello, > > > > I am trying to execute a query that joins two large tables (3 million and > 20 million records). I am getting the Hadoop error code 2 during execution. > This happens mainly while the reducers are running. Sometimes the reducers > complete 100% and then the error comes. The logs talk about running out of > Heap space and GC overhead limit exceeding. > > > > I am running a 6 node cluster with child JVM memory of 1GB. > > > > Are there any parameters I could tweak to make them run? Is adding more > nodes the solution to such problem? > > > > Thanks! > > >