Hadoop not utilizing max reducer capacity + reducer stuck in pending state
Hi, I am seeing some strange behavior in Hadoop - I am running a small test cluster with a capacity of 18 mappers and 18 reducers. I fire a lot of jobs simultaneously and over time I have observed Hadoop is not utilizing all the 18 slots for the reducers. And now even if I run just one job (no other jobs running), it starts less than 18 reducers. Initially it was starting all 18 but gradually it decreased. For example it started only 13 reducers for a job that I just submitted. Further, one reducer is stuck in pending state for a very long time. While all other reducers finished, one reducer was stuck in pending state for at least 20-30 minutes. The mappers seem to be doing fine. Any thoughts/suggestions what could be happening here? Cluster conf- 1) Master- also runs 4 mappers + 4 reducers 2) 2 slaves- run 7 mappers + 7 reducers I run ganglia monitoring system and I can tell you system was not overloaded at any time. Thanks, Tarandeep
Re: Hadoop not utilizing max reducer capacity + reducer stuck in pending state
ok I missed one thing.. I had turned on the speculative execution... so this explains why less reducers are running (there are 2 reducers running for some tasks).. However, I still could not find why a reducer was stuck at pending state for a long time... there were no other jobs running and all other reducers had finished. On Wed, Aug 18, 2010 at 2:44 PM, Tarandeep Singh tarand...@gmail.comwrote: Hi, I am seeing some strange behavior in Hadoop - I am running a small test cluster with a capacity of 18 mappers and 18 reducers. I fire a lot of jobs simultaneously and over time I have observed Hadoop is not utilizing all the 18 slots for the reducers. And now even if I run just one job (no other jobs running), it starts less than 18 reducers. Initially it was starting all 18 but gradually it decreased. For example it started only 13 reducers for a job that I just submitted. Further, one reducer is stuck in pending state for a very long time. While all other reducers finished, one reducer was stuck in pending state for at least 20-30 minutes. The mappers seem to be doing fine. Any thoughts/suggestions what could be happening here? Cluster conf- 1) Master- also runs 4 mappers + 4 reducers 2) 2 slaves- run 7 mappers + 7 reducers I run ganglia monitoring system and I can tell you system was not overloaded at any time. Thanks, Tarandeep
Re: Reducer stuck at pending state
Hi Todd,I'm using hadoop 0.20.1, apache distribution. I didnt set the property you mentioned and I think they should remain default (1G?). The cluster I'm playing with has four master nodes, and 96 slave nodes physically. Hadoop uses one master node for namenode and jobstracker, and picks 12 nodes for its data and tasktrackers. Interestingly, I noticed the hardware specification is a liltle different between master and slave mahchines. So I changed the namenode and jobstracker to one of the slaves. The problem seems solved. (My program runs normally SO FAR) However, I cannot find the concrete hardware configuration for each node, but I guess the differences should exist mainly on the CPUs or RAMs. These are copied from the cluster's specification manual: Slaves: each with two 2.6 GHz dual-core opteron processors, 8 GB RAM, 16 GB swap space and 50 GB of local scratch space Masters: each with four 2.6 GHz dual-core opteron processors, 32 GB RAM, 64 GB swap space, 64 GB of local scratch space Can you see what the problem is? Thanks a lot. Regards Song Liu On Wed, Feb 17, 2010 at 4:18 AM, Todd Lipcon t...@cloudera.com wrote: Hi Song, What version are you running? How much memory have you allocated to the reducers in mapred.child.java.opts? -Todd On Tue, Feb 16, 2010 at 4:01 PM, Song Liu lamfeeli...@gmail.com wrote: Sorry, seems no attachment is allowed, I paste it here: JobidPriorityUserNameMap % CompleteMap TotalMaps CompletedReduce % CompleteReduce TotalReduces Completed Job Scheduling Information job_2... NORMAL sl9885TF/IDF 100.00% 26 260.00% 10 NA job_2... NORMAL sl9885Rank100.00% 2222 0.00% 10NA job_2... NORMAL sl9885TF/IDF 100.00% 20 200.00%10 NA The format is horrible, sorry for that, but it's the best I can do :( BTW, I guess it should not be my program's problem, since I have tested it on some other clusters before. Regards Song Liu On Tue, Feb 16, 2010 at 11:51 PM, Song Liu lamfeeli...@gmail.com wrote: Hi all, I recently have me t a problem that sometimes, reducer hang up at pending state, with 0% complete. It seems all the mappers are completely done, and when it just about to start the reducer, the reducer stuck, without any given warnings and errors and was staying at the pending state. I have a cluster with 12 nodes. But this situation only appears when the scale of data is large (2GB or more), smaller cases never met this problem. Any one has met this issue before? I searched JIRA, some one proposed this issue before, but no solution was given. ( https://issues.apache.org/jira/browse/MAPREDUCE-24?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12647230#action_12647230 ) The typical case of this issue is captured in the attachment. Regards Song Liu
Reducer stuck at pending state
Hi all, I recently have me t a problem that sometimes, reducer hang up at pending state, with 0% complete. It seems all the mappers are completely done, and when it just about to start the reducer, the reducer stuck, without any given warnings and errors and was staying at the pending state. I have a cluster with 12 nodes. But this situation only appears when the scale of data is large (2GB or more), smaller cases never met this problem. Any one has met this issue before? I searched JIRA, some one proposed this issue before, but no solution was given. ( https://issues.apache.org/jira/browse/MAPREDUCE-24?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12647230#action_12647230 ) The typical case of this issue is captured in the attachment. Regards Song Liu
Re: Reducer stuck at pending state
Sorry, seems no attachment is allowed, I paste it here: JobidPriorityUserNameMap % CompleteMap TotalMaps CompletedReduce % CompleteReduce TotalReduces CompletedJob Scheduling Information job_2... NORMAL sl9885TF/IDF 100.00% 26 260.00% 10 NA job_2... NORMAL sl9885Rank100.00% 2222 0.00% 10NA job_2... NORMAL sl9885TF/IDF 100.00% 20 200.00%10 NA The format is horrible, sorry for that, but it's the best I can do :( BTW, I guess it should not be my program's problem, since I have tested it on some other clusters before. Regards Song Liu On Tue, Feb 16, 2010 at 11:51 PM, Song Liu lamfeeli...@gmail.com wrote: Hi all, I recently have me t a problem that sometimes, reducer hang up at pending state, with 0% complete. It seems all the mappers are completely done, and when it just about to start the reducer, the reducer stuck, without any given warnings and errors and was staying at the pending state. I have a cluster with 12 nodes. But this situation only appears when the scale of data is large (2GB or more), smaller cases never met this problem. Any one has met this issue before? I searched JIRA, some one proposed this issue before, but no solution was given. ( https://issues.apache.org/jira/browse/MAPREDUCE-24?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12647230#action_12647230 ) The typical case of this issue is captured in the attachment. Regards Song Liu
Re: Reducer stuck at pending state
Hi Song, What version are you running? How much memory have you allocated to the reducers in mapred.child.java.opts? -Todd On Tue, Feb 16, 2010 at 4:01 PM, Song Liu lamfeeli...@gmail.com wrote: Sorry, seems no attachment is allowed, I paste it here: Jobid Priority User Name Map % Complete Map Total Maps Completed Reduce % Complete Reduce Total Reduces Completed Job Scheduling Information job_2... NORMAL sl9885 TF/IDF 100.00% 26 26 0.00% 1 0 NA job_2... NORMAL sl9885 Rank 100.00% 22 22 0.00% 1 0 NA job_2... NORMAL sl9885 TF/IDF 100.00% 20 20 0.00% 1 0 NA The format is horrible, sorry for that, but it's the best I can do :( BTW, I guess it should not be my program's problem, since I have tested it on some other clusters before. Regards Song Liu On Tue, Feb 16, 2010 at 11:51 PM, Song Liu lamfeeli...@gmail.com wrote: Hi all, I recently have me t a problem that sometimes, reducer hang up at pending state, with 0% complete. It seems all the mappers are completely done, and when it just about to start the reducer, the reducer stuck, without any given warnings and errors and was staying at the pending state. I have a cluster with 12 nodes. But this situation only appears when the scale of data is large (2GB or more), smaller cases never met this problem. Any one has met this issue before? I searched JIRA, some one proposed this issue before, but no solution was given. ( https://issues.apache.org/jira/browse/MAPREDUCE-24?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12647230#action_12647230 ) The typical case of this issue is captured in the attachment. Regards Song Liu