Re: Reduce tasks running out of memory on small hadoop cluster

2008-09-22 Thread Karl Anderson


On 20-Sep-08, at 7:07 PM, Ryan LeCompte wrote:


Hello all,

I'm setting up a small 3 node hadoop cluster (1 node for
namenode/jobtracker and the other two for datanode/tasktracker). The
map tasks finish fine, but the reduce tasks are failing at about 30%
with an out of memory error. My guess is because the amount of data
that I'm crunching through just won't be able to fit in memory during
the reduce tasks on two machines (max of 2 reduce tasks on each
machine). Is this expected? If I had a large hadoop cluster, then I
could increase the number of reduce tasks on each machine of the
cluster so that not all of the data to be processed is occurring in
just 4 JVMs on two machines like I currently have setup, correct? Is
there any way to get the reduce task to not try and hold all of the
data in memory, or is my only option to add more nodes to the cluster
to therefore increase the number of reduce tasks?


You can set the number of reduce tasks with a configuration option.   
More tasks means less input per task; since the number of concurrent  
tasks doesn't change, this should help you.  I'd like to be able to  
set the number of concurrent tasks, myself, but haven't noticed a way.


In the end, I had to practice better design to reduce my memory  
footprint; sometimes one quick-and-dirty way to do this is to turn one  
job into a chain of jobs that each do less.



Karl Anderson
[EMAIL PROTECTED]
http://monkey.org/~kra





Re: Reduce tasks running out of memory on small hadoop cluster

2008-09-21 Thread Ryan LeCompte
I actually solved the problem by increasing a parameter in
hadoop-site.xml, since the default wasn't sufficient:


  mapred.child.java.opts
  -Xmx1024m


Thanks,
Ryan


On Sun, Sep 21, 2008 at 12:59 AM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
> Yes I did, but that didn't solve my problem since I'm working with a fairly
> large data set (8gb).
>
> Thanks,
> Ryan
>
>
>
>
> On Sep 21, 2008, at 12:22 AM, Sandy <[EMAIL PROTECTED]> wrote:
>
>> Have you increased the heapsize in conf/hadoop-env.sh to 2000? This helped
>> me some, but eventually I had to upgrade to a system with more memory.
>>
>> -SM
>>
>>
>> On Sat, Sep 20, 2008 at 9:07 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
>>
>>> Hello all,
>>>
>>> I'm setting up a small 3 node hadoop cluster (1 node for
>>> namenode/jobtracker and the other two for datanode/tasktracker). The
>>> map tasks finish fine, but the reduce tasks are failing at about 30%
>>> with an out of memory error. My guess is because the amount of data
>>> that I'm crunching through just won't be able to fit in memory during
>>> the reduce tasks on two machines (max of 2 reduce tasks on each
>>> machine). Is this expected? If I had a large hadoop cluster, then I
>>> could increase the number of reduce tasks on each machine of the
>>> cluster so that not all of the data to be processed is occurring in
>>> just 4 JVMs on two machines like I currently have setup, correct? Is
>>> there any way to get the reduce task to not try and hold all of the
>>> data in memory, or is my only option to add more nodes to the cluster
>>> to therefore increase the number of reduce tasks?
>>>
>>> Thanks!
>>>
>>> Ryan
>>>
>


Re: Reduce tasks running out of memory on small hadoop cluster

2008-09-20 Thread Ryan LeCompte
Yes I did, but that didn't solve my problem since I'm working with a  
fairly large data set (8gb).


Thanks,
Ryan




On Sep 21, 2008, at 12:22 AM, Sandy <[EMAIL PROTECTED]> wrote:

Have you increased the heapsize in conf/hadoop-env.sh to 2000? This  
helped

me some, but eventually I had to upgrade to a system with more memory.

-SM


On Sat, Sep 20, 2008 at 9:07 PM, Ryan LeCompte <[EMAIL PROTECTED]>  
wrote:



Hello all,

I'm setting up a small 3 node hadoop cluster (1 node for
namenode/jobtracker and the other two for datanode/tasktracker). The
map tasks finish fine, but the reduce tasks are failing at about 30%
with an out of memory error. My guess is because the amount of data
that I'm crunching through just won't be able to fit in memory during
the reduce tasks on two machines (max of 2 reduce tasks on each
machine). Is this expected? If I had a large hadoop cluster, then I
could increase the number of reduce tasks on each machine of the
cluster so that not all of the data to be processed is occurring in
just 4 JVMs on two machines like I currently have setup, correct? Is
there any way to get the reduce task to not try and hold all of the
data in memory, or is my only option to add more nodes to the cluster
to therefore increase the number of reduce tasks?

Thanks!

Ryan



Re: Reduce tasks running out of memory on small hadoop cluster

2008-09-20 Thread Sandy
Have you increased the heapsize in conf/hadoop-env.sh to 2000? This helped
me some, but eventually I had to upgrade to a system with more memory.

-SM


On Sat, Sep 20, 2008 at 9:07 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:

> Hello all,
>
> I'm setting up a small 3 node hadoop cluster (1 node for
> namenode/jobtracker and the other two for datanode/tasktracker). The
> map tasks finish fine, but the reduce tasks are failing at about 30%
> with an out of memory error. My guess is because the amount of data
> that I'm crunching through just won't be able to fit in memory during
> the reduce tasks on two machines (max of 2 reduce tasks on each
> machine). Is this expected? If I had a large hadoop cluster, then I
> could increase the number of reduce tasks on each machine of the
> cluster so that not all of the data to be processed is occurring in
> just 4 JVMs on two machines like I currently have setup, correct? Is
> there any way to get the reduce task to not try and hold all of the
> data in memory, or is my only option to add more nodes to the cluster
> to therefore increase the number of reduce tasks?
>
> Thanks!
>
> Ryan
>


Reduce tasks running out of memory on small hadoop cluster

2008-09-20 Thread Ryan LeCompte
Hello all,

I'm setting up a small 3 node hadoop cluster (1 node for
namenode/jobtracker and the other two for datanode/tasktracker). The
map tasks finish fine, but the reduce tasks are failing at about 30%
with an out of memory error. My guess is because the amount of data
that I'm crunching through just won't be able to fit in memory during
the reduce tasks on two machines (max of 2 reduce tasks on each
machine). Is this expected? If I had a large hadoop cluster, then I
could increase the number of reduce tasks on each machine of the
cluster so that not all of the data to be processed is occurring in
just 4 JVMs on two machines like I currently have setup, correct? Is
there any way to get the reduce task to not try and hold all of the
data in memory, or is my only option to add more nodes to the cluster
to therefore increase the number of reduce tasks?

Thanks!

Ryan