Hadoop not utilizing max reducer capacity + reducer stuck in pending state

2010-08-18 Thread Tarandeep Singh
Hi,

I am seeing some strange behavior in Hadoop - I am running a small test
cluster with a capacity of 18 mappers and 18 reducers. I fire a lot of jobs
simultaneously and over time I have observed Hadoop is not utilizing all the
18 slots for the reducers.

And now even if I run just one job (no other jobs running), it starts less
than 18 reducers. Initially it was starting all 18 but gradually it
decreased. For example it started only 13 reducers for a job that I just
submitted.

Further, one reducer is stuck in pending state for a very long time. While
all other reducers finished, one reducer was stuck in pending state for at
least 20-30 minutes.

The mappers seem to be doing fine. Any thoughts/suggestions what could be
happening here?

Cluster conf-
1) Master- also runs 4 mappers + 4 reducers
2) 2 slaves- run 7 mappers + 7 reducers

I run ganglia monitoring system and I can tell you system was not overloaded
at any time.

Thanks,
Tarandeep


Re: Hadoop not utilizing max reducer capacity + reducer stuck in pending state

2010-08-18 Thread Tarandeep Singh
ok I missed one thing.. I had turned on the speculative execution...
so this explains why less reducers are running (there are 2 reducers running
for some tasks)..

However, I still could not find why a reducer was stuck at pending state for
a long time... there were no other jobs running and all other reducers had
finished.



On Wed, Aug 18, 2010 at 2:44 PM, Tarandeep Singh tarand...@gmail.comwrote:

 Hi,

 I am seeing some strange behavior in Hadoop - I am running a small test
 cluster with a capacity of 18 mappers and 18 reducers. I fire a lot of jobs
 simultaneously and over time I have observed Hadoop is not utilizing all the
 18 slots for the reducers.

 And now even if I run just one job (no other jobs running), it starts less
 than 18 reducers. Initially it was starting all 18 but gradually it
 decreased. For example it started only 13 reducers for a job that I just
 submitted.

 Further, one reducer is stuck in pending state for a very long time. While
 all other reducers finished, one reducer was stuck in pending state for at
 least 20-30 minutes.

 The mappers seem to be doing fine. Any thoughts/suggestions what could be
 happening here?

 Cluster conf-
 1) Master- also runs 4 mappers + 4 reducers
 2) 2 slaves- run 7 mappers + 7 reducers

 I run ganglia monitoring system and I can tell you system was not
 overloaded at any time.

 Thanks,
 Tarandeep






Re: Reducer stuck at pending state

2010-02-17 Thread Song Liu
Hi Todd,I'm using hadoop 0.20.1, apache distribution.
I didnt set the property you mentioned and I think they should remain
default (1G?).

The cluster I'm playing with has four master nodes, and 96 slave nodes
physically. Hadoop uses one master node for namenode and jobstracker, and
picks 12 nodes for its data and tasktrackers.

Interestingly, I noticed the hardware specification is a liltle different
between master and slave mahchines. So I changed the namenode and
jobstracker to one of the slaves. The problem seems solved. (My program runs
normally SO FAR)

However, I cannot find the concrete hardware configuration for each node,
but I guess the differences should exist mainly on the CPUs or RAMs.

These are copied from the cluster's specification manual:

Slaves:

each with two 2.6 GHz dual-core opteron processors, 8 GB RAM, 16 GB swap
space and 50 GB of local scratch space

Masters:

each with four 2.6 GHz dual-core opteron processors, 32 GB RAM, 64 GB swap
space, 64 GB of local scratch space

Can you see what the problem is?

Thanks a lot.
Regards
Song Liu

On Wed, Feb 17, 2010 at 4:18 AM, Todd Lipcon t...@cloudera.com wrote:

 Hi Song,

 What version are you running? How much memory have you allocated to
 the reducers in mapred.child.java.opts?

 -Todd

 On Tue, Feb 16, 2010 at 4:01 PM, Song Liu lamfeeli...@gmail.com wrote:
  Sorry, seems no attachment is allowed, I paste it here:
 
  JobidPriorityUserNameMap % CompleteMap TotalMaps
  CompletedReduce % CompleteReduce TotalReduces Completed
  Job
  Scheduling Information
  job_2... NORMAL  sl9885TF/IDF 100.00%  26
  260.00% 10
 NA
  job_2... NORMAL  sl9885Rank100.00%  2222
 0.00% 10NA
  job_2... NORMAL  sl9885TF/IDF 100.00%  20
  200.00%10
  NA
 
  The format is horrible, sorry for that, but it's the best I can do :(
 
  BTW, I guess it should not be my program's problem, since I have tested
 it
  on some other clusters before.
 
  Regards
  Song Liu
 
  On Tue, Feb 16, 2010 at 11:51 PM, Song Liu lamfeeli...@gmail.com
 wrote:
 
  Hi all, I recently have me t a problem that sometimes, reducer hang up
 at
  pending state, with 0% complete.
 
  It seems all the mappers are completely done, and when it just about to
  start the reducer, the reducer stuck, without any given warnings and
 errors
  and was staying at the pending state.
 
  I have a cluster with 12 nodes. But this situation only appears when the
  scale of data is large (2GB or more), smaller cases never met this
 problem.
 
  Any one has met this issue before? I searched JIRA, some one proposed
 this
  issue before, but no solution was given. (
 
 https://issues.apache.org/jira/browse/MAPREDUCE-24?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12647230#action_12647230
  )
 
  The typical case of this issue is captured in the attachment.
 
  Regards
  Song Liu
 
 



Reducer stuck at pending state

2010-02-16 Thread Song Liu
Hi all, I recently have me t a problem that sometimes, reducer hang up at
pending state, with 0% complete.

It seems all the mappers are completely done, and when it just about to
start the reducer, the reducer stuck, without any given warnings and errors
and was staying at the pending state.

I have a cluster with 12 nodes. But this situation only appears when the
scale of data is large (2GB or more), smaller cases never met this problem.

Any one has met this issue before? I searched JIRA, some one proposed this
issue before, but no solution was given. (
https://issues.apache.org/jira/browse/MAPREDUCE-24?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12647230#action_12647230
)

The typical case of this issue is captured in the attachment.

Regards
Song Liu


Re: Reducer stuck at pending state

2010-02-16 Thread Song Liu
Sorry, seems no attachment is allowed, I paste it here:

JobidPriorityUserNameMap % CompleteMap TotalMaps
CompletedReduce % CompleteReduce TotalReduces CompletedJob
Scheduling Information
job_2... NORMAL  sl9885TF/IDF 100.00%  26
260.00% 10
NA
job_2... NORMAL  sl9885Rank100.00%  2222
0.00% 10NA
job_2... NORMAL  sl9885TF/IDF 100.00%  20
200.00%10
NA

The format is horrible, sorry for that, but it's the best I can do :(

BTW, I guess it should not be my program's problem, since I have tested it
on some other clusters before.

Regards
Song Liu

On Tue, Feb 16, 2010 at 11:51 PM, Song Liu lamfeeli...@gmail.com wrote:

 Hi all, I recently have me t a problem that sometimes, reducer hang up at
 pending state, with 0% complete.

 It seems all the mappers are completely done, and when it just about to
 start the reducer, the reducer stuck, without any given warnings and errors
 and was staying at the pending state.

 I have a cluster with 12 nodes. But this situation only appears when the
 scale of data is large (2GB or more), smaller cases never met this problem.

 Any one has met this issue before? I searched JIRA, some one proposed this
 issue before, but no solution was given. (
 https://issues.apache.org/jira/browse/MAPREDUCE-24?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12647230#action_12647230
 )

 The typical case of this issue is captured in the attachment.

 Regards
 Song Liu



Re: Reducer stuck at pending state

2010-02-16 Thread Todd Lipcon
Hi Song,

What version are you running? How much memory have you allocated to
the reducers in mapred.child.java.opts?

-Todd

On Tue, Feb 16, 2010 at 4:01 PM, Song Liu lamfeeli...@gmail.com wrote:
 Sorry, seems no attachment is allowed, I paste it here:

 Jobid    Priority    User    Name    Map % Complete    Map Total    Maps
 Completed    Reduce % Complete    Reduce Total    Reduces Completed    Job
 Scheduling Information
 job_2... NORMAL      sl9885    TF/IDF     100.00%          26
 26                0.00%                 1                0
    NA
 job_2... NORMAL      sl9885    Rank    100.00%          22            22
            0.00%                 1                0                    NA
 job_2... NORMAL      sl9885    TF/IDF     100.00%          20
 20                0.00%                1                0
 NA

 The format is horrible, sorry for that, but it's the best I can do :(

 BTW, I guess it should not be my program's problem, since I have tested it
 on some other clusters before.

 Regards
 Song Liu

 On Tue, Feb 16, 2010 at 11:51 PM, Song Liu lamfeeli...@gmail.com wrote:

 Hi all, I recently have me t a problem that sometimes, reducer hang up at
 pending state, with 0% complete.

 It seems all the mappers are completely done, and when it just about to
 start the reducer, the reducer stuck, without any given warnings and errors
 and was staying at the pending state.

 I have a cluster with 12 nodes. But this situation only appears when the
 scale of data is large (2GB or more), smaller cases never met this problem.

 Any one has met this issue before? I searched JIRA, some one proposed this
 issue before, but no solution was given. (
 https://issues.apache.org/jira/browse/MAPREDUCE-24?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12647230#action_12647230
 )

 The typical case of this issue is captured in the attachment.

 Regards
 Song Liu