Re: Reduce tasks running out of memory on small hadoop cluster
On 20-Sep-08, at 7:07 PM, Ryan LeCompte wrote: Hello all, I'm setting up a small 3 node hadoop cluster (1 node for namenode/jobtracker and the other two for datanode/tasktracker). The map tasks finish fine, but the reduce tasks are failing at about 30% with an out of memory error. My guess is because the amount of data that I'm crunching through just won't be able to fit in memory during the reduce tasks on two machines (max of 2 reduce tasks on each machine). Is this expected? If I had a large hadoop cluster, then I could increase the number of reduce tasks on each machine of the cluster so that not all of the data to be processed is occurring in just 4 JVMs on two machines like I currently have setup, correct? Is there any way to get the reduce task to not try and hold all of the data in memory, or is my only option to add more nodes to the cluster to therefore increase the number of reduce tasks? You can set the number of reduce tasks with a configuration option. More tasks means less input per task; since the number of concurrent tasks doesn't change, this should help you. I'd like to be able to set the number of concurrent tasks, myself, but haven't noticed a way. In the end, I had to practice better design to reduce my memory footprint; sometimes one quick-and-dirty way to do this is to turn one job into a chain of jobs that each do less. Karl Anderson [EMAIL PROTECTED] http://monkey.org/~kra
Re: Reduce tasks running out of memory on small hadoop cluster
I actually solved the problem by increasing a parameter in hadoop-site.xml, since the default wasn't sufficient: mapred.child.java.opts -Xmx1024m Thanks, Ryan On Sun, Sep 21, 2008 at 12:59 AM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: > Yes I did, but that didn't solve my problem since I'm working with a fairly > large data set (8gb). > > Thanks, > Ryan > > > > > On Sep 21, 2008, at 12:22 AM, Sandy <[EMAIL PROTECTED]> wrote: > >> Have you increased the heapsize in conf/hadoop-env.sh to 2000? This helped >> me some, but eventually I had to upgrade to a system with more memory. >> >> -SM >> >> >> On Sat, Sep 20, 2008 at 9:07 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: >> >>> Hello all, >>> >>> I'm setting up a small 3 node hadoop cluster (1 node for >>> namenode/jobtracker and the other two for datanode/tasktracker). The >>> map tasks finish fine, but the reduce tasks are failing at about 30% >>> with an out of memory error. My guess is because the amount of data >>> that I'm crunching through just won't be able to fit in memory during >>> the reduce tasks on two machines (max of 2 reduce tasks on each >>> machine). Is this expected? If I had a large hadoop cluster, then I >>> could increase the number of reduce tasks on each machine of the >>> cluster so that not all of the data to be processed is occurring in >>> just 4 JVMs on two machines like I currently have setup, correct? Is >>> there any way to get the reduce task to not try and hold all of the >>> data in memory, or is my only option to add more nodes to the cluster >>> to therefore increase the number of reduce tasks? >>> >>> Thanks! >>> >>> Ryan >>> >
Re: Reduce tasks running out of memory on small hadoop cluster
Yes I did, but that didn't solve my problem since I'm working with a fairly large data set (8gb). Thanks, Ryan On Sep 21, 2008, at 12:22 AM, Sandy <[EMAIL PROTECTED]> wrote: Have you increased the heapsize in conf/hadoop-env.sh to 2000? This helped me some, but eventually I had to upgrade to a system with more memory. -SM On Sat, Sep 20, 2008 at 9:07 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: Hello all, I'm setting up a small 3 node hadoop cluster (1 node for namenode/jobtracker and the other two for datanode/tasktracker). The map tasks finish fine, but the reduce tasks are failing at about 30% with an out of memory error. My guess is because the amount of data that I'm crunching through just won't be able to fit in memory during the reduce tasks on two machines (max of 2 reduce tasks on each machine). Is this expected? If I had a large hadoop cluster, then I could increase the number of reduce tasks on each machine of the cluster so that not all of the data to be processed is occurring in just 4 JVMs on two machines like I currently have setup, correct? Is there any way to get the reduce task to not try and hold all of the data in memory, or is my only option to add more nodes to the cluster to therefore increase the number of reduce tasks? Thanks! Ryan
Re: Reduce tasks running out of memory on small hadoop cluster
Have you increased the heapsize in conf/hadoop-env.sh to 2000? This helped me some, but eventually I had to upgrade to a system with more memory. -SM On Sat, Sep 20, 2008 at 9:07 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: > Hello all, > > I'm setting up a small 3 node hadoop cluster (1 node for > namenode/jobtracker and the other two for datanode/tasktracker). The > map tasks finish fine, but the reduce tasks are failing at about 30% > with an out of memory error. My guess is because the amount of data > that I'm crunching through just won't be able to fit in memory during > the reduce tasks on two machines (max of 2 reduce tasks on each > machine). Is this expected? If I had a large hadoop cluster, then I > could increase the number of reduce tasks on each machine of the > cluster so that not all of the data to be processed is occurring in > just 4 JVMs on two machines like I currently have setup, correct? Is > there any way to get the reduce task to not try and hold all of the > data in memory, or is my only option to add more nodes to the cluster > to therefore increase the number of reduce tasks? > > Thanks! > > Ryan >
Reduce tasks running out of memory on small hadoop cluster
Hello all, I'm setting up a small 3 node hadoop cluster (1 node for namenode/jobtracker and the other two for datanode/tasktracker). The map tasks finish fine, but the reduce tasks are failing at about 30% with an out of memory error. My guess is because the amount of data that I'm crunching through just won't be able to fit in memory during the reduce tasks on two machines (max of 2 reduce tasks on each machine). Is this expected? If I had a large hadoop cluster, then I could increase the number of reduce tasks on each machine of the cluster so that not all of the data to be processed is occurring in just 4 JVMs on two machines like I currently have setup, correct? Is there any way to get the reduce task to not try and hold all of the data in memory, or is my only option to add more nodes to the cluster to therefore increase the number of reduce tasks? Thanks! Ryan