[jira] [Commented] (YARN-4728) MapReduce job doesn't make any progress for a very very long time after one Node become unusable.
[ https://issues.apache.org/jira/browse/YARN-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170481#comment-15170481 ] zhihai xu commented on YARN-4728: - Yes, MAPREDUCE-6513 is possible, but YARN-1680 may be more possible. Because blacklisted nodes can happen easier in your environment than MAPREDUCE-6513 especially with mapreduce.job.reduce.slowstart.completedmaps=1. To see whether it is MAPREDUCE-6513 or YARN-1680, you need check the log to see wether reduce task is preempted. If reduce task is preempted and map task still can't get resource, it is MAPREDUCE-6513/MAPREDUCE-6514. Otherwise, it is YARN-1680. Even YARN-1680 is fixed, which trigger the preemption, MAPREDUCE-6513 still will happen. > MapReduce job doesn't make any progress for a very very long time after one > Node become unusable. > - > > Key: YARN-4728 > URL: https://issues.apache.org/jira/browse/YARN-4728 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, nodemanager, resourcemanager >Affects Versions: 2.6.0 > Environment: hadoop 2.6.0 > yarn >Reporter: Silnov >Priority: Critical > Original Estimate: 24h > Remaining Estimate: 24h > > I have some nodes running hadoop 2.6.0. > The cluster's configuration remain default largely. > I run some job on the cluster(especially some job processing a lot of data) > every day. > Sometimes, I found my job remain the same progression for a very very long > time. So I have to kill the job mannually and re-submit it to the cluster. It > works well before(re-submit the job and it run to the end), but something go > wrong today. > After I re-submit the same task for 3 times, its running go deadlock(the > progression doesn't change for a long time, and each time has a different > progress value.e.g.33.01%,45.8%,73.21%). > I begin to check the web UI for the hadoop, then I find there are 98 map > suspend while all the running reduce task have consumed all the avaliable > memory. I stop the yarn and add configuration below into yarn-site.xml and > then restart the yarn. > yarn.app.mapreduce.am.job.reduce.rampup.limit > 0.1 > yarn.app.mapreduce.am.job.reduce.preemption.limit > 1.0 > (wanting the yarn to preempt the reduce task's resource to run suspending map > task) > After restart the yarn,I submit the job with the property > mapreduce.job.reduce.slowstart.completedmaps=1. > but the same result happen again!!(my job remain the same progress value for > a very very long time) > I check the web UI for the hadoop again,and find that the suspended map task > is newed with the previous note:"TaskAttempt killed because it ran on > unusable node node02:21349". > Then I check the resourcemanager's log and find some useful messages below: > **Deactivating Node node02:21349 as it is now LOST. > **node02:21349 Node Transitioned from RUNNING to LOST. > I think this may happen because my network across the cluster is not good > which cause the RM don't receive the NM's heartbeat in time. > But I wonder that why the yarn framework can't preempt the running reduce > task's resource to run the suspend map task?(this cause the job remain the > same progress value for a very very long time:( ) > Any one can help? > Thank you very much! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4728) MapReduce job doesn't make any progress for a very very long time after one Node become unusable.
[ https://issues.apache.org/jira/browse/YARN-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170406#comment-15170406 ] Silnov commented on YARN-4728: -- Varun Saxena,thanks for your response! I have checked MAPREDUCE-6513. The scenario is similar to that as you said. I'll get some knowledge from it:) > MapReduce job doesn't make any progress for a very very long time after one > Node become unusable. > - > > Key: YARN-4728 > URL: https://issues.apache.org/jira/browse/YARN-4728 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, nodemanager, resourcemanager >Affects Versions: 2.6.0 > Environment: hadoop 2.6.0 > yarn >Reporter: Silnov >Priority: Critical > Original Estimate: 24h > Remaining Estimate: 24h > > I have some nodes running hadoop 2.6.0. > The cluster's configuration remain default largely. > I run some job on the cluster(especially some job processing a lot of data) > every day. > Sometimes, I found my job remain the same progression for a very very long > time. So I have to kill the job mannually and re-submit it to the cluster. It > works well before(re-submit the job and it run to the end), but something go > wrong today. > After I re-submit the same task for 3 times, its running go deadlock(the > progression doesn't change for a long time, and each time has a different > progress value.e.g.33.01%,45.8%,73.21%). > I begin to check the web UI for the hadoop, then I find there are 98 map > suspend while all the running reduce task have consumed all the avaliable > memory. I stop the yarn and add configuration below into yarn-site.xml and > then restart the yarn. > yarn.app.mapreduce.am.job.reduce.rampup.limit > 0.1 > yarn.app.mapreduce.am.job.reduce.preemption.limit > 1.0 > (wanting the yarn to preempt the reduce task's resource to run suspending map > task) > After restart the yarn,I submit the job with the property > mapreduce.job.reduce.slowstart.completedmaps=1. > but the same result happen again!!(my job remain the same progress value for > a very very long time) > I check the web UI for the hadoop again,and find that the suspended map task > is newed with the previous note:"TaskAttempt killed because it ran on > unusable node node02:21349". > Then I check the resourcemanager's log and find some useful messages below: > **Deactivating Node node02:21349 as it is now LOST. > **node02:21349 Node Transitioned from RUNNING to LOST. > I think this may happen because my network across the cluster is not good > which cause the RM don't receive the NM's heartbeat in time. > But I wonder that why the yarn framework can't preempt the running reduce > task's resource to run the suspend map task?(this cause the job remain the > same progress value for a very very long time:( ) > Any one can help? > Thank you very much! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4728) MapReduce job doesn't make any progress for a very very long time after one Node become unusable.
[ https://issues.apache.org/jira/browse/YARN-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170403#comment-15170403 ] Silnov commented on YARN-4728: -- zhihai xu,thanks for your response! I will try to make some changes following your advice! > MapReduce job doesn't make any progress for a very very long time after one > Node become unusable. > - > > Key: YARN-4728 > URL: https://issues.apache.org/jira/browse/YARN-4728 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, nodemanager, resourcemanager >Affects Versions: 2.6.0 > Environment: hadoop 2.6.0 > yarn >Reporter: Silnov >Priority: Critical > Original Estimate: 24h > Remaining Estimate: 24h > > I have some nodes running hadoop 2.6.0. > The cluster's configuration remain default largely. > I run some job on the cluster(especially some job processing a lot of data) > every day. > Sometimes, I found my job remain the same progression for a very very long > time. So I have to kill the job mannually and re-submit it to the cluster. It > works well before(re-submit the job and it run to the end), but something go > wrong today. > After I re-submit the same task for 3 times, its running go deadlock(the > progression doesn't change for a long time, and each time has a different > progress value.e.g.33.01%,45.8%,73.21%). > I begin to check the web UI for the hadoop, then I find there are 98 map > suspend while all the running reduce task have consumed all the avaliable > memory. I stop the yarn and add configuration below into yarn-site.xml and > then restart the yarn. > yarn.app.mapreduce.am.job.reduce.rampup.limit > 0.1 > yarn.app.mapreduce.am.job.reduce.preemption.limit > 1.0 > (wanting the yarn to preempt the reduce task's resource to run suspending map > task) > After restart the yarn,I submit the job with the property > mapreduce.job.reduce.slowstart.completedmaps=1. > but the same result happen again!!(my job remain the same progress value for > a very very long time) > I check the web UI for the hadoop again,and find that the suspended map task > is newed with the previous note:"TaskAttempt killed because it ran on > unusable node node02:21349". > Then I check the resourcemanager's log and find some useful messages below: > **Deactivating Node node02:21349 as it is now LOST. > **node02:21349 Node Transitioned from RUNNING to LOST. > I think this may happen because my network across the cluster is not good > which cause the RM don't receive the NM's heartbeat in time. > But I wonder that why the yarn framework can't preempt the running reduce > task's resource to run the suspend map task?(this cause the job remain the > same progress value for a very very long time:( ) > Any one can help? > Thank you very much! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4728) MapReduce job doesn't make any progress for a very very long time after one Node become unusable.
[ https://issues.apache.org/jira/browse/YARN-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15160343#comment-15160343 ] Varun Saxena commented on YARN-4728: [~Silnov], in addition to above, can you check your AM logs and see if scenario is similar to MAPREDUCE-6513 ? I suspect its same. > MapReduce job doesn't make any progress for a very very long time after one > Node become unusable. > - > > Key: YARN-4728 > URL: https://issues.apache.org/jira/browse/YARN-4728 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, nodemanager, resourcemanager >Affects Versions: 2.6.0 > Environment: hadoop 2.6.0 > yarn >Reporter: Silnov >Priority: Critical > Original Estimate: 24h > Remaining Estimate: 24h > > I have some nodes running hadoop 2.6.0. > The cluster's configuration remain default largely. > I run some job on the cluster(especially some job processing a lot of data) > every day. > Sometimes, I found my job remain the same progression for a very very long > time. So I have to kill the job mannually and re-submit it to the cluster. It > works well before(re-submit the job and it run to the end), but something go > wrong today. > After I re-submit the same task for 3 times, its running go deadlock(the > progression doesn't change for a long time, and each time has a different > progress value.e.g.33.01%,45.8%,73.21%). > I begin to check the web UI for the hadoop, then I find there are 98 map > suspend while all the running reduce task have consumed all the avaliable > memory. I stop the yarn and add configuration below into yarn-site.xml and > then restart the yarn. > yarn.app.mapreduce.am.job.reduce.rampup.limit > 0.1 > yarn.app.mapreduce.am.job.reduce.preemption.limit > 1.0 > (wanting the yarn to preempt the reduce task's resource to run suspending map > task) > After restart the yarn,I submit the job with the property > mapreduce.job.reduce.slowstart.completedmaps=1. > but the same result happen again!!(my job remain the same progress value for > a very very long time) > I check the web UI for the hadoop again,and find that the suspended map task > is newed with the previous note:"TaskAttempt killed because it ran on > unusable node node02:21349". > Then I check the resourcemanager's log and find some useful messages below: > **Deactivating Node node02:21349 as it is now LOST. > **node02:21349 Node Transitioned from RUNNING to LOST. > I think this may happen because my network across the cluster is not good > which cause the RM don't receive the NM's heartbeat in time. > But I wonder that why the yarn framework can't preempt the running reduce > task's resource to run the suspend map task?(this cause the job remain the > same progress value for a very very long time:( ) > Any one can help? > Thank you very much! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4728) MapReduce job doesn't make any progress for a very very long time after one Node become unusable.
[ https://issues.apache.org/jira/browse/YARN-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15160305#comment-15160305 ] zhihai xu commented on YARN-4728: - Thanks for reporting this issue [~Silnov]! It looks like this issue is caused by the long timeout at two level. This issue is similar as YARN-3944, YARN-4414, YARN-3238 and YARN-3554. You may work around this issue by changing the configuration values: "ipc.client.connect.max.retries.on.timeouts" (default is 45), "ipc.client.connect.timeout"(default is 2ms) and "yarn.client.nodemanager-connect.max-wait-ms" (default is 900,000ms). > MapReduce job doesn't make any progress for a very very long time after one > Node become unusable. > - > > Key: YARN-4728 > URL: https://issues.apache.org/jira/browse/YARN-4728 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, nodemanager, resourcemanager >Affects Versions: 2.6.0 > Environment: hadoop 2.6.0 > yarn >Reporter: Silnov >Priority: Critical > Original Estimate: 24h > Remaining Estimate: 24h > > I have some nodes running hadoop 2.6.0. > The cluster's configuration remain default largely. > I run some job on the cluster(especially some job processing a lot of data) > every day. > Sometimes, I found my job remain the same progression for a very very long > time. So I have to kill the job mannually and re-submit it to the cluster. It > works well before(re-submit the job and it run to the end), but something go > wrong today. > After I re-submit the same task for 3 times, its running go deadlock(the > progression doesn't change for a long time, and each time has a different > progress value.e.g.33.01%,45.8%,73.21%). > I begin to check the web UI for the hadoop, then I find there are 98 map > suspend while all the running reduce task have consumed all the avaliable > memory. I stop the yarn and add configuration below into yarn-site.xml and > then restart the yarn. > yarn.app.mapreduce.am.job.reduce.rampup.limit > 0.1 > yarn.app.mapreduce.am.job.reduce.preemption.limit > 1.0 > (wanting the yarn to preempt the reduce task's resource to run suspending map > task) > After restart the yarn,I submit the job with the property > mapreduce.job.reduce.slowstart.completedmaps=1. > but the same result happen again!!(my job remain the same progress value for > a very very long time) > I check the web UI for the hadoop again,and find that the suspended map task > is newed with the previous note:"TaskAttempt killed because it ran on > unusable node node02:21349". > Then I check the resourcemanager's log and find some useful messages below: > **Deactivating Node node02:21349 as it is now LOST. > **node02:21349 Node Transitioned from RUNNING to LOST. > I think this may happen because my network across the cluster is not good > which cause the RM don't receive the NM's heartbeat in time. > But I wonder that why the yarn framework can't preempt the running reduce > task's resource to run the suspend map task?(this cause the job remain the > same progress value for a very very long time:( ) > Any one can help? > Thank you very much! -- This message was sent by Atlassian JIRA (v6.3.4#6332)