Silnov created YARN-4728: ---------------------------- Summary: MapReduce job doesn't make any progress for a very very long time after one Node become unusable. Key: YARN-4728 URL: https://issues.apache.org/jira/browse/YARN-4728 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, nodemanager, resourcemanager Affects Versions: 2.6.0 Environment: hadoop 2.6.0 yarn Reporter: Silnov Priority: Critical
I have some nodes running hadoop 2.6.0. The cluster's configuration remain default largely. I run some job on the cluster(especially some job processing a lot of data) every day. Sometimes, I found my job remain the same progression for a very very long time. So I have to kill the job mannually and re-submit it to the cluster. It works well before(re-submit the job and it run to the end), but something go wrong today. After I re-submit the same task for 3 times, its running go deadlock(the progression doesn't change for a long time, and each time has a different progress value.e.g.33.01%,45.8%,73.21%). I begin to check the web UI for the hadoop, then I find there are 98 map suspend while all the running reduce task have consumed all the avaliable memory. I stop the yarn and add configuration below into yarn-site.xml and then restart the yarn. <property>yarn.app.mapreduce.am.job.reduce.rampup.limit</property> <value>0.1</value> <property>yarn.app.mapreduce.am.job.reduce.preemption.limit</property> <value>1.0</value> (wanting the yarn to preempt the reduce task's resource to run suspending map task) After restart the yarn,I submit the job with the property mapreduce.job.reduce.slowstart.completedmaps=1. but the same result happen again!!(my job remain the same progress value for a very very long time) I check the web UI for the hadoop again,and find that the suspended map task is newed with the previous note:"TaskAttempt killed because it ran on unusable node node02:21349". Then I check the resourcemanager's log and find some useful messages below: ******Deactivating Node node02:21349 as it is now LOST. ******node02:21349 Node Transitioned from RUNNING to LOST. I think this may happen because my network across the cluster is not good which cause the RM don't receive the NM's heartbeat in time. But I wonder that why the yarn framework can't preempt the running reduce task's resource to run the suspend map task?(this cause the job remain the same progress value for a very very long time:( ) Any one can help? Thank you very much! -- This message was sent by Atlassian JIRA (v6.3.4#6332)