Re: Reduce tasks running out of memory on small hadoop cluster

2008-09-20 Thread Ryan LeCompte
Yes I did, but that didn't solve my problem since I'm working with a  
fairly large data set (8gb).


Thanks,
Ryan




On Sep 21, 2008, at 12:22 AM, Sandy <[EMAIL PROTECTED]> wrote:

Have you increased the heapsize in conf/hadoop-env.sh to 2000? This  
helped

me some, but eventually I had to upgrade to a system with more memory.

-SM


On Sat, Sep 20, 2008 at 9:07 PM, Ryan LeCompte <[EMAIL PROTECTED]>  
wrote:



Hello all,

I'm setting up a small 3 node hadoop cluster (1 node for
namenode/jobtracker and the other two for datanode/tasktracker). The
map tasks finish fine, but the reduce tasks are failing at about 30%
with an out of memory error. My guess is because the amount of data
that I'm crunching through just won't be able to fit in memory during
the reduce tasks on two machines (max of 2 reduce tasks on each
machine). Is this expected? If I had a large hadoop cluster, then I
could increase the number of reduce tasks on each machine of the
cluster so that not all of the data to be processed is occurring in
just 4 JVMs on two machines like I currently have setup, correct? Is
there any way to get the reduce task to not try and hold all of the
data in memory, or is my only option to add more nodes to the cluster
to therefore increase the number of reduce tasks?

Thanks!

Ryan



Re: Reduce tasks running out of memory on small hadoop cluster

2008-09-20 Thread Sandy
Have you increased the heapsize in conf/hadoop-env.sh to 2000? This helped
me some, but eventually I had to upgrade to a system with more memory.

-SM


On Sat, Sep 20, 2008 at 9:07 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:

> Hello all,
>
> I'm setting up a small 3 node hadoop cluster (1 node for
> namenode/jobtracker and the other two for datanode/tasktracker). The
> map tasks finish fine, but the reduce tasks are failing at about 30%
> with an out of memory error. My guess is because the amount of data
> that I'm crunching through just won't be able to fit in memory during
> the reduce tasks on two machines (max of 2 reduce tasks on each
> machine). Is this expected? If I had a large hadoop cluster, then I
> could increase the number of reduce tasks on each machine of the
> cluster so that not all of the data to be processed is occurring in
> just 4 JVMs on two machines like I currently have setup, correct? Is
> there any way to get the reduce task to not try and hold all of the
> data in memory, or is my only option to add more nodes to the cluster
> to therefore increase the number of reduce tasks?
>
> Thanks!
>
> Ryan
>


Reduce tasks running out of memory on small hadoop cluster

2008-09-20 Thread Ryan LeCompte
Hello all,

I'm setting up a small 3 node hadoop cluster (1 node for
namenode/jobtracker and the other two for datanode/tasktracker). The
map tasks finish fine, but the reduce tasks are failing at about 30%
with an out of memory error. My guess is because the amount of data
that I'm crunching through just won't be able to fit in memory during
the reduce tasks on two machines (max of 2 reduce tasks on each
machine). Is this expected? If I had a large hadoop cluster, then I
could increase the number of reduce tasks on each machine of the
cluster so that not all of the data to be processed is occurring in
just 4 JVMs on two machines like I currently have setup, correct? Is
there any way to get the reduce task to not try and hold all of the
data in memory, or is my only option to add more nodes to the cluster
to therefore increase the number of reduce tasks?

Thanks!

Ryan


Re: Is pipes really supposed to be a serious API? Is it being actively developed?

2008-09-20 Thread Owen O'Malley
I'm sorry that your questions haven't been answered. Pipes is used  
extensively at Yahoo to build the Webmap, which is the graph of the  
entire web. You can now set counters in pipes, but mostly it has just  
been working for Yahoo. It isn't expected to perform better the java  
api, because all of the framework code is the same java. It seems to  
perform very close to the java. If you have patches to make it more  
portable, that would be great.


-- Owen

On Sep 19, 2008, at 11:10 AM, Marc Vaillant <[EMAIL PROTECTED]>  
wrote:



Only about 5 pipes/c++ related posts since mid July, and basically no
responses.  Is anyone really using or actively developing pipes?   
We've

invested some time to make it platform independent (ported bsd sockets
to boost sockets, and the xdr serialization to boost serialization),  
but

it's still lacking an efficient way to submit jobs, more efficient
mechanism for executing map/reduce without launching pipes  
executables,

etc.

Hello, anyone out there?

Marc



Re: Tips on sorting using Hadoop

2008-09-20 Thread Owen O'Malley
On Sat, Sep 20, 2008 at 11:12 AM, lohit <[EMAIL PROTECTED]> wrote:

> To do total order sorting, you have to make your partition function split
> the keyspace equally in order among the number of reducers.


A library to do this was checked in yesterday. See HADOOP-3019.

-- Owen


Re: Tips on sorting using Hadoop

2008-09-20 Thread lohit
Since this is sorting, does it help if you run map/reduce twice? Number of 
output bytes should be same as input bytes.
To do total order sorting, you have to make your partition function split the 
keyspace equally in order among the number of reducers. 
For example look at the TeraSort as to how this is done. 
http://svn.apache.org/repos/asf/hadoop/core/trunk/src/examples/org/apache/hadoop/examples/terasort/TeraSort.java

Thanks,
Lohit



- Original Message 
From: Edward J. Yoon <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Saturday, September 20, 2008 10:53:40 AM
Subject: Re: Tips on sorting using Hadoop

I would recommend that run map/reduce twice.

/Edward

On Sat, Sep 13, 2008 at 5:58 AM, Tenaali Ram <[EMAIL PROTECTED]> wrote:
> Hi,
> I want to sort my records ( consisting of string, int, float) using Hadoop.
>
> One way I have found is to set number of reducers = 1, but this would mean
> all the records go to 1 reducer and it won't be optimized. Can anyone point
> me to some better way to do sorting using Hadoop ?
>
> Thanks,
> Tenaali
>



-- 
Best regards, Edward J. Yoon
[EMAIL PROTECTED]
http://blog.udanax.org



Re: Tips on sorting using Hadoop

2008-09-20 Thread Edward J. Yoon
I would recommend that run map/reduce twice.

/Edward

On Sat, Sep 13, 2008 at 5:58 AM, Tenaali Ram <[EMAIL PROTECTED]> wrote:
> Hi,
> I want to sort my records ( consisting of string, int, float) using Hadoop.
>
> One way I have found is to set number of reducers = 1, but this would mean
> all the records go to 1 reducer and it won't be optimized. Can anyone point
> me to some better way to do sorting using Hadoop ?
>
> Thanks,
> Tenaali
>



-- 
Best regards, Edward J. Yoon
[EMAIL PROTECTED]
http://blog.udanax.org


Re: Data Transfer mechanism between different clusters

2008-09-20 Thread Edward J. Yoon
See http://hadoop.apache.org/core/docs/current/distcp.html

AFAIK, while copying between different cluster, distcp uses hftp protocol.

/Edward

On Sat, Sep 20, 2008 at 5:43 PM, Wasim Bari <[EMAIL PROTECTED]> wrote:
>
> Hello All,
>  what kind of support Hadoop provides for data transfer between
> more than one cluster residing on different geographical locations (might be
> by using WAN) ?
> is there any fast and efficient method available ? (Like  GridFTP in Globus
> )
>
> Thanks,
>
> Wasim
>



-- 
Best regards, Edward J. Yoon
[EMAIL PROTECTED]
http://blog.udanax.org


Data Transfer mechanism between different clusters

2008-09-20 Thread Wasim Bari


Hello All,
  what kind of support Hadoop provides for data transfer 
between more than one cluster residing on different geographical locations 
(might be by using WAN) ?
is there any fast and efficient method available ? (Like  GridFTP in 
Globus )


Thanks,

Wasim