Re: Hadoop error 2 while joining two large tables

hadoop n00b Thu, 17 Mar 2011 02:10:05 -0700

Hello All,

Thanks a lot for your response. To clarify a few points -


I am on CDH2 with Hive 0.4 (I think). We cannot move to a higher version of
Hive as we have to use Cloudera distro only.

All records in the smaller table have at least one record in the larger
table (of course a few exceptions could be there but only a few).

The join is using ON clause. The query is something like -

select ...
from
(
  (select ... from smaller_table)
  join
  (select from larger_table)
  on (smaller_table.col = larger_table.col)
)
I will try out setting mapred.child.java.opts -Xmx to a higher value and let
you know.

Is there a pattern or rule of thumb to follow on when to add more nodes?

Thanks again!
On Thu, Mar 17, 2011 at 1:08 AM, Steven Wong <sw...@netflix.com> wrote:

>  In addition, put the smaller table on the left-hand side of a JOIN:
>
>
>
> SELECT ... FROM small_table JOIN large_table ON ...
>
>
>
>
>
> *From:* Bejoy Ks [mailto:bejoy...@yahoo.com]
> *Sent:* Wednesday, March 16, 2011 11:43 AM
>
> *To:* user@hive.apache.org
> *Subject:* Re: Hadoop error 2 while joining two large tables
>
>
>
> Hey hadoop n00b
>     I second Mark's thought. But definitely you can try out re framing your
> query to get things rolling. I'm not sure on your hive Query.But still, from
> my experience with joins on huge tables (record counts in the range of
> hundreds of millions) you should give join conditions with JOIN ON clause
> rather than specifying all conditions in WHERE.
>
> Say if you have a query this way
> SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b WHERE
> a.Column4=b.Column1 AND a.Column2=b.Column4 AND a.Column3 > b.Column2;
>
> You can definitely re frame this query as
> SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b
> ON (a.Column4=b.Column1 AND a.Column2=b.Column4)  WHERE a.Column3 >
> b.Column2;
>
> From my understanding Hive supports equijoins so you can't have the
> inequality conditions there within JOIN ON, inequality should come to WHERE.
> This approach has worked for me when I encountered a similar situation as
> yours some time ago. Try this out,hope it helps.
>
> Regards
> Bejoy.K.S
>
>
>
>
>  ------------------------------
>
> *From:* "Sunderlin, Mark" <mark.sunder...@teamaol.com>
> *To:* "user@hive.apache.org" <user@hive.apache.org>
> *Sent:* Wed, March 16, 2011 11:22:09 PM
> *Subject:* RE: Hadoop error 2 while joining two large tables
>
>
>  hadoop n00b asks, “Is adding more nodes the solution to such problem?”
>
>
>
> Whatever else answers you get, you should append “ … and add more nodes.”
> More nodes is never a bad thing ;-)
>
>
>
> ---
>
> *Mark E. Sunderlin*
>
> *Solutions Architect **|AOL **Data Warehouse*
>
> P: 703-256-6935 | C: 540-327-6222
>
> AIM: MESunderlin
>
> 22000 AOL Way | Dulles, VA | 20166
>
>
>
>
>
> *From:* hadoop n00b [mailto:new2h...@gmail.com]
> *Sent:* Wednesday, March 16, 2011 3:33 AM
> *To:* user@hive.apache.org
> *Subject:* Fwd: Hadoop error 2 while joining two large tables
>
>
>
> Hello,
>
>
>
> I am trying to execute a query that joins two large tables (3 million and
> 20 million records). I am getting the Hadoop error code 2 during execution.
> This happens mainly while the reducers are running. Sometimes the reducers
> complete 100% and then the error comes. The logs talk about running out of
> Heap space and GC overhead limit exceeding.
>
>
>
> I am running a 6 node cluster with child JVM memory of 1GB.
>
>
>
> Are there any parameters I could tweak to make them run? Is adding more
> nodes the solution to such problem?
>
>
>
> Thanks!
>
>
>

Re: Hadoop error 2 while joining two large tables

Reply via email to