Re: Reduce tasks running out of memory on small hadoop cluster
Yes I did, but that didn't solve my problem since I'm working with a fairly large data set (8gb). Thanks, Ryan On Sep 21, 2008, at 12:22 AM, Sandy <[EMAIL PROTECTED]> wrote: Have you increased the heapsize in conf/hadoop-env.sh to 2000? This helped me some, but eventually I had to upgrade to a system with more memory. -SM On Sat, Sep 20, 2008 at 9:07 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: Hello all, I'm setting up a small 3 node hadoop cluster (1 node for namenode/jobtracker and the other two for datanode/tasktracker). The map tasks finish fine, but the reduce tasks are failing at about 30% with an out of memory error. My guess is because the amount of data that I'm crunching through just won't be able to fit in memory during the reduce tasks on two machines (max of 2 reduce tasks on each machine). Is this expected? If I had a large hadoop cluster, then I could increase the number of reduce tasks on each machine of the cluster so that not all of the data to be processed is occurring in just 4 JVMs on two machines like I currently have setup, correct? Is there any way to get the reduce task to not try and hold all of the data in memory, or is my only option to add more nodes to the cluster to therefore increase the number of reduce tasks? Thanks! Ryan
Re: Reduce tasks running out of memory on small hadoop cluster
Have you increased the heapsize in conf/hadoop-env.sh to 2000? This helped me some, but eventually I had to upgrade to a system with more memory. -SM On Sat, Sep 20, 2008 at 9:07 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: > Hello all, > > I'm setting up a small 3 node hadoop cluster (1 node for > namenode/jobtracker and the other two for datanode/tasktracker). The > map tasks finish fine, but the reduce tasks are failing at about 30% > with an out of memory error. My guess is because the amount of data > that I'm crunching through just won't be able to fit in memory during > the reduce tasks on two machines (max of 2 reduce tasks on each > machine). Is this expected? If I had a large hadoop cluster, then I > could increase the number of reduce tasks on each machine of the > cluster so that not all of the data to be processed is occurring in > just 4 JVMs on two machines like I currently have setup, correct? Is > there any way to get the reduce task to not try and hold all of the > data in memory, or is my only option to add more nodes to the cluster > to therefore increase the number of reduce tasks? > > Thanks! > > Ryan >
Reduce tasks running out of memory on small hadoop cluster
Hello all, I'm setting up a small 3 node hadoop cluster (1 node for namenode/jobtracker and the other two for datanode/tasktracker). The map tasks finish fine, but the reduce tasks are failing at about 30% with an out of memory error. My guess is because the amount of data that I'm crunching through just won't be able to fit in memory during the reduce tasks on two machines (max of 2 reduce tasks on each machine). Is this expected? If I had a large hadoop cluster, then I could increase the number of reduce tasks on each machine of the cluster so that not all of the data to be processed is occurring in just 4 JVMs on two machines like I currently have setup, correct? Is there any way to get the reduce task to not try and hold all of the data in memory, or is my only option to add more nodes to the cluster to therefore increase the number of reduce tasks? Thanks! Ryan
Re: Is pipes really supposed to be a serious API? Is it being actively developed?
I'm sorry that your questions haven't been answered. Pipes is used extensively at Yahoo to build the Webmap, which is the graph of the entire web. You can now set counters in pipes, but mostly it has just been working for Yahoo. It isn't expected to perform better the java api, because all of the framework code is the same java. It seems to perform very close to the java. If you have patches to make it more portable, that would be great. -- Owen On Sep 19, 2008, at 11:10 AM, Marc Vaillant <[EMAIL PROTECTED]> wrote: Only about 5 pipes/c++ related posts since mid July, and basically no responses. Is anyone really using or actively developing pipes? We've invested some time to make it platform independent (ported bsd sockets to boost sockets, and the xdr serialization to boost serialization), but it's still lacking an efficient way to submit jobs, more efficient mechanism for executing map/reduce without launching pipes executables, etc. Hello, anyone out there? Marc
Re: Tips on sorting using Hadoop
On Sat, Sep 20, 2008 at 11:12 AM, lohit <[EMAIL PROTECTED]> wrote: > To do total order sorting, you have to make your partition function split > the keyspace equally in order among the number of reducers. A library to do this was checked in yesterday. See HADOOP-3019. -- Owen
Re: Tips on sorting using Hadoop
Since this is sorting, does it help if you run map/reduce twice? Number of output bytes should be same as input bytes. To do total order sorting, you have to make your partition function split the keyspace equally in order among the number of reducers. For example look at the TeraSort as to how this is done. http://svn.apache.org/repos/asf/hadoop/core/trunk/src/examples/org/apache/hadoop/examples/terasort/TeraSort.java Thanks, Lohit - Original Message From: Edward J. Yoon <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Saturday, September 20, 2008 10:53:40 AM Subject: Re: Tips on sorting using Hadoop I would recommend that run map/reduce twice. /Edward On Sat, Sep 13, 2008 at 5:58 AM, Tenaali Ram <[EMAIL PROTECTED]> wrote: > Hi, > I want to sort my records ( consisting of string, int, float) using Hadoop. > > One way I have found is to set number of reducers = 1, but this would mean > all the records go to 1 reducer and it won't be optimized. Can anyone point > me to some better way to do sorting using Hadoop ? > > Thanks, > Tenaali > -- Best regards, Edward J. Yoon [EMAIL PROTECTED] http://blog.udanax.org
Re: Tips on sorting using Hadoop
I would recommend that run map/reduce twice. /Edward On Sat, Sep 13, 2008 at 5:58 AM, Tenaali Ram <[EMAIL PROTECTED]> wrote: > Hi, > I want to sort my records ( consisting of string, int, float) using Hadoop. > > One way I have found is to set number of reducers = 1, but this would mean > all the records go to 1 reducer and it won't be optimized. Can anyone point > me to some better way to do sorting using Hadoop ? > > Thanks, > Tenaali > -- Best regards, Edward J. Yoon [EMAIL PROTECTED] http://blog.udanax.org
Re: Data Transfer mechanism between different clusters
See http://hadoop.apache.org/core/docs/current/distcp.html AFAIK, while copying between different cluster, distcp uses hftp protocol. /Edward On Sat, Sep 20, 2008 at 5:43 PM, Wasim Bari <[EMAIL PROTECTED]> wrote: > > Hello All, > what kind of support Hadoop provides for data transfer between > more than one cluster residing on different geographical locations (might be > by using WAN) ? > is there any fast and efficient method available ? (Like GridFTP in Globus > ) > > Thanks, > > Wasim > -- Best regards, Edward J. Yoon [EMAIL PROTECTED] http://blog.udanax.org
Data Transfer mechanism between different clusters
Hello All, what kind of support Hadoop provides for data transfer between more than one cluster residing on different geographical locations (might be by using WAN) ? is there any fast and efficient method available ? (Like GridFTP in Globus ) Thanks, Wasim