[jira] [Commented] (YARN-4728) MapReduce job doesn't make any progress for a very very long time after one Node become unusable.

2016-02-27 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170481#comment-15170481
 ] 

zhihai xu commented on YARN-4728:
-

Yes, MAPREDUCE-6513 is possible, but YARN-1680 may be more possible. Because 
blacklisted nodes can happen easier in your environment than MAPREDUCE-6513 
especially with mapreduce.job.reduce.slowstart.completedmaps=1. To see whether 
it is MAPREDUCE-6513 or YARN-1680, you need check the log to see wether reduce 
task is preempted. If reduce task is preempted and map task still can't get 
resource, it is MAPREDUCE-6513/MAPREDUCE-6514. Otherwise, it is YARN-1680. Even 
YARN-1680 is fixed, which trigger the preemption, MAPREDUCE-6513 still will 
happen.

> MapReduce job doesn't make any progress for a very very long time after one 
> Node become unusable.
> -
>
> Key: YARN-4728
> URL: https://issues.apache.org/jira/browse/YARN-4728
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, nodemanager, resourcemanager
>Affects Versions: 2.6.0
> Environment: hadoop 2.6.0
> yarn
>Reporter: Silnov
>Priority: Critical
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I have some nodes running hadoop 2.6.0.
> The cluster's configuration remain default largely.
> I run some job on the cluster(especially some job processing a lot of data) 
> every day.
> Sometimes, I found my job remain the same progression for a very very long 
> time. So I have to kill the job mannually and re-submit it to the cluster. It 
> works well before(re-submit the job and it run to the end), but something go 
> wrong today.
> After I re-submit the same task for 3 times, its running go deadlock(the 
> progression doesn't change for a long time, and each time has a different 
> progress value.e.g.33.01%,45.8%,73.21%).
> I begin to check the web UI for the hadoop, then I find there are 98 map 
> suspend while all the running reduce task have consumed all the avaliable  
> memory. I stop the yarn and add configuration below  into yarn-site.xml and 
> then restart the yarn.
> yarn.app.mapreduce.am.job.reduce.rampup.limit
> 0.1
> yarn.app.mapreduce.am.job.reduce.preemption.limit
> 1.0
> (wanting the yarn to preempt the reduce task's resource to run suspending map 
> task)
> After restart the yarn,I submit the job with the property 
> mapreduce.job.reduce.slowstart.completedmaps=1.
> but the same result happen again!!(my job remain the same progress value for 
> a very very long time)
> I check the web UI for the hadoop again,and find that the suspended map task 
> is newed with the previous note:"TaskAttempt killed because it ran on 
> unusable node node02:21349".
> Then I check the resourcemanager's log and find some useful messages below:
> **Deactivating Node node02:21349 as it is now LOST.
> **node02:21349 Node Transitioned from RUNNING to LOST.
> I think this may happen because my network across the cluster is not good 
> which cause the RM don't receive the NM's heartbeat in time.
> But I wonder that why the yarn framework can't preempt the running reduce 
> task's resource to run the suspend map task?(this cause the job remain the 
> same progress value for a very very long time:( )
> Any one can help?
> Thank you very much!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4728) MapReduce job doesn't make any progress for a very very long time after one Node become unusable.

2016-02-26 Thread Silnov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170406#comment-15170406
 ] 

Silnov commented on YARN-4728:
--

Varun Saxena,thanks for your response!
I have checked MAPREDUCE-6513. The scenario is similar to that as you said. 
I'll get some knowledge from it:) 

> MapReduce job doesn't make any progress for a very very long time after one 
> Node become unusable.
> -
>
> Key: YARN-4728
> URL: https://issues.apache.org/jira/browse/YARN-4728
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, nodemanager, resourcemanager
>Affects Versions: 2.6.0
> Environment: hadoop 2.6.0
> yarn
>Reporter: Silnov
>Priority: Critical
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I have some nodes running hadoop 2.6.0.
> The cluster's configuration remain default largely.
> I run some job on the cluster(especially some job processing a lot of data) 
> every day.
> Sometimes, I found my job remain the same progression for a very very long 
> time. So I have to kill the job mannually and re-submit it to the cluster. It 
> works well before(re-submit the job and it run to the end), but something go 
> wrong today.
> After I re-submit the same task for 3 times, its running go deadlock(the 
> progression doesn't change for a long time, and each time has a different 
> progress value.e.g.33.01%,45.8%,73.21%).
> I begin to check the web UI for the hadoop, then I find there are 98 map 
> suspend while all the running reduce task have consumed all the avaliable  
> memory. I stop the yarn and add configuration below  into yarn-site.xml and 
> then restart the yarn.
> yarn.app.mapreduce.am.job.reduce.rampup.limit
> 0.1
> yarn.app.mapreduce.am.job.reduce.preemption.limit
> 1.0
> (wanting the yarn to preempt the reduce task's resource to run suspending map 
> task)
> After restart the yarn,I submit the job with the property 
> mapreduce.job.reduce.slowstart.completedmaps=1.
> but the same result happen again!!(my job remain the same progress value for 
> a very very long time)
> I check the web UI for the hadoop again,and find that the suspended map task 
> is newed with the previous note:"TaskAttempt killed because it ran on 
> unusable node node02:21349".
> Then I check the resourcemanager's log and find some useful messages below:
> **Deactivating Node node02:21349 as it is now LOST.
> **node02:21349 Node Transitioned from RUNNING to LOST.
> I think this may happen because my network across the cluster is not good 
> which cause the RM don't receive the NM's heartbeat in time.
> But I wonder that why the yarn framework can't preempt the running reduce 
> task's resource to run the suspend map task?(this cause the job remain the 
> same progress value for a very very long time:( )
> Any one can help?
> Thank you very much!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4728) MapReduce job doesn't make any progress for a very very long time after one Node become unusable.

2016-02-26 Thread Silnov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170403#comment-15170403
 ] 

Silnov commented on YARN-4728:
--

zhihai xu,thanks for your response!
I will try to make some changes following your advice!

> MapReduce job doesn't make any progress for a very very long time after one 
> Node become unusable.
> -
>
> Key: YARN-4728
> URL: https://issues.apache.org/jira/browse/YARN-4728
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, nodemanager, resourcemanager
>Affects Versions: 2.6.0
> Environment: hadoop 2.6.0
> yarn
>Reporter: Silnov
>Priority: Critical
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I have some nodes running hadoop 2.6.0.
> The cluster's configuration remain default largely.
> I run some job on the cluster(especially some job processing a lot of data) 
> every day.
> Sometimes, I found my job remain the same progression for a very very long 
> time. So I have to kill the job mannually and re-submit it to the cluster. It 
> works well before(re-submit the job and it run to the end), but something go 
> wrong today.
> After I re-submit the same task for 3 times, its running go deadlock(the 
> progression doesn't change for a long time, and each time has a different 
> progress value.e.g.33.01%,45.8%,73.21%).
> I begin to check the web UI for the hadoop, then I find there are 98 map 
> suspend while all the running reduce task have consumed all the avaliable  
> memory. I stop the yarn and add configuration below  into yarn-site.xml and 
> then restart the yarn.
> yarn.app.mapreduce.am.job.reduce.rampup.limit
> 0.1
> yarn.app.mapreduce.am.job.reduce.preemption.limit
> 1.0
> (wanting the yarn to preempt the reduce task's resource to run suspending map 
> task)
> After restart the yarn,I submit the job with the property 
> mapreduce.job.reduce.slowstart.completedmaps=1.
> but the same result happen again!!(my job remain the same progress value for 
> a very very long time)
> I check the web UI for the hadoop again,and find that the suspended map task 
> is newed with the previous note:"TaskAttempt killed because it ran on 
> unusable node node02:21349".
> Then I check the resourcemanager's log and find some useful messages below:
> **Deactivating Node node02:21349 as it is now LOST.
> **node02:21349 Node Transitioned from RUNNING to LOST.
> I think this may happen because my network across the cluster is not good 
> which cause the RM don't receive the NM's heartbeat in time.
> But I wonder that why the yarn framework can't preempt the running reduce 
> task's resource to run the suspend map task?(this cause the job remain the 
> same progress value for a very very long time:( )
> Any one can help?
> Thank you very much!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4728) MapReduce job doesn't make any progress for a very very long time after one Node become unusable.

2016-02-23 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15160343#comment-15160343
 ] 

Varun Saxena commented on YARN-4728:


[~Silnov], in addition to above, can you check your AM logs and see if scenario 
is similar to MAPREDUCE-6513 ?
I suspect its same.

> MapReduce job doesn't make any progress for a very very long time after one 
> Node become unusable.
> -
>
> Key: YARN-4728
> URL: https://issues.apache.org/jira/browse/YARN-4728
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, nodemanager, resourcemanager
>Affects Versions: 2.6.0
> Environment: hadoop 2.6.0
> yarn
>Reporter: Silnov
>Priority: Critical
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I have some nodes running hadoop 2.6.0.
> The cluster's configuration remain default largely.
> I run some job on the cluster(especially some job processing a lot of data) 
> every day.
> Sometimes, I found my job remain the same progression for a very very long 
> time. So I have to kill the job mannually and re-submit it to the cluster. It 
> works well before(re-submit the job and it run to the end), but something go 
> wrong today.
> After I re-submit the same task for 3 times, its running go deadlock(the 
> progression doesn't change for a long time, and each time has a different 
> progress value.e.g.33.01%,45.8%,73.21%).
> I begin to check the web UI for the hadoop, then I find there are 98 map 
> suspend while all the running reduce task have consumed all the avaliable  
> memory. I stop the yarn and add configuration below  into yarn-site.xml and 
> then restart the yarn.
> yarn.app.mapreduce.am.job.reduce.rampup.limit
> 0.1
> yarn.app.mapreduce.am.job.reduce.preemption.limit
> 1.0
> (wanting the yarn to preempt the reduce task's resource to run suspending map 
> task)
> After restart the yarn,I submit the job with the property 
> mapreduce.job.reduce.slowstart.completedmaps=1.
> but the same result happen again!!(my job remain the same progress value for 
> a very very long time)
> I check the web UI for the hadoop again,and find that the suspended map task 
> is newed with the previous note:"TaskAttempt killed because it ran on 
> unusable node node02:21349".
> Then I check the resourcemanager's log and find some useful messages below:
> **Deactivating Node node02:21349 as it is now LOST.
> **node02:21349 Node Transitioned from RUNNING to LOST.
> I think this may happen because my network across the cluster is not good 
> which cause the RM don't receive the NM's heartbeat in time.
> But I wonder that why the yarn framework can't preempt the running reduce 
> task's resource to run the suspend map task?(this cause the job remain the 
> same progress value for a very very long time:( )
> Any one can help?
> Thank you very much!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4728) MapReduce job doesn't make any progress for a very very long time after one Node become unusable.

2016-02-23 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15160305#comment-15160305
 ] 

zhihai xu commented on YARN-4728:
-

Thanks for reporting this issue [~Silnov]! 
It looks like this issue is caused by the long timeout at two level. This issue 
is similar as YARN-3944, YARN-4414, YARN-3238 and YARN-3554. You may work 
around this issue by changing the configuration values: 
"ipc.client.connect.max.retries.on.timeouts" (default is 45),  
"ipc.client.connect.timeout"(default is 2ms) and 
"yarn.client.nodemanager-connect.max-wait-ms" (default is 900,000ms).

> MapReduce job doesn't make any progress for a very very long time after one 
> Node become unusable.
> -
>
> Key: YARN-4728
> URL: https://issues.apache.org/jira/browse/YARN-4728
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, nodemanager, resourcemanager
>Affects Versions: 2.6.0
> Environment: hadoop 2.6.0
> yarn
>Reporter: Silnov
>Priority: Critical
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I have some nodes running hadoop 2.6.0.
> The cluster's configuration remain default largely.
> I run some job on the cluster(especially some job processing a lot of data) 
> every day.
> Sometimes, I found my job remain the same progression for a very very long 
> time. So I have to kill the job mannually and re-submit it to the cluster. It 
> works well before(re-submit the job and it run to the end), but something go 
> wrong today.
> After I re-submit the same task for 3 times, its running go deadlock(the 
> progression doesn't change for a long time, and each time has a different 
> progress value.e.g.33.01%,45.8%,73.21%).
> I begin to check the web UI for the hadoop, then I find there are 98 map 
> suspend while all the running reduce task have consumed all the avaliable  
> memory. I stop the yarn and add configuration below  into yarn-site.xml and 
> then restart the yarn.
> yarn.app.mapreduce.am.job.reduce.rampup.limit
> 0.1
> yarn.app.mapreduce.am.job.reduce.preemption.limit
> 1.0
> (wanting the yarn to preempt the reduce task's resource to run suspending map 
> task)
> After restart the yarn,I submit the job with the property 
> mapreduce.job.reduce.slowstart.completedmaps=1.
> but the same result happen again!!(my job remain the same progress value for 
> a very very long time)
> I check the web UI for the hadoop again,and find that the suspended map task 
> is newed with the previous note:"TaskAttempt killed because it ran on 
> unusable node node02:21349".
> Then I check the resourcemanager's log and find some useful messages below:
> **Deactivating Node node02:21349 as it is now LOST.
> **node02:21349 Node Transitioned from RUNNING to LOST.
> I think this may happen because my network across the cluster is not good 
> which cause the RM don't receive the NM's heartbeat in time.
> But I wonder that why the yarn framework can't preempt the running reduce 
> task's resource to run the suspend map task?(this cause the job remain the 
> same progress value for a very very long time:( )
> Any one can help?
> Thank you very much!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)