[jira] [Updated] (YARN-10430) Log improvements in NodeStatusUpdaterImpl

2020-09-14 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10430:
---
Component/s: nodemanager

> Log improvements in NodeStatusUpdaterImpl
> -
>
> Key: YARN-10430
> URL: https://issues.apache.org/jira/browse/YARN-10430
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Minor
> Fix For: 3.3.1, 3.4.1
>
> Attachments: YARN-10430.001.patch
>
>
> I think in below places log should be printed only if list size is not zero.
> {code:java}
> if (LOG.isDebugEnabled()) {
>   LOG.debug("The cache log aggregation status size:"
>   + logAggregationReports.size());
>  }
> {code}
> {code:java}
> LOG.info("Sending out " + containerStatuses.size()
>   + " NM container statuses: " + containerStatuses);
> {code}
> {code:java}
> if (LOG.isDebugEnabled()) {
>   LOG.debug("Sending out " + containerStatuses.size()
>   + " container statuses: " + containerStatuses);
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10430) Log improvements in NodeStatusUpdaterImpl

2020-09-14 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10430:
---
Fix Version/s: 3.4.1
   3.3.1

> Log improvements in NodeStatusUpdaterImpl
> -
>
> Key: YARN-10430
> URL: https://issues.apache.org/jira/browse/YARN-10430
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Minor
> Fix For: 3.3.1, 3.4.1
>
> Attachments: YARN-10430.001.patch
>
>
> I think in below places log should be printed only if list size is not zero.
> {code:java}
> if (LOG.isDebugEnabled()) {
>   LOG.debug("The cache log aggregation status size:"
>   + logAggregationReports.size());
>  }
> {code}
> {code:java}
> LOG.info("Sending out " + containerStatuses.size()
>   + " NM container statuses: " + containerStatuses);
> {code}
> {code:java}
> if (LOG.isDebugEnabled()) {
>   LOG.debug("Sending out " + containerStatuses.size()
>   + " container statuses: " + containerStatuses);
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10430) Log improvements in NodeStatusUpdaterImpl

2020-09-14 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195742#comment-17195742
 ] 

Jim Brennan commented on YARN-10430:


Thanks [~BilwaST]!  I have committed this to trunk and branch-3.3.  It did not 
apply cleanly to branch-3.2 and earlier, so please put up a patch for those if 
desired.

 

> Log improvements in NodeStatusUpdaterImpl
> -
>
> Key: YARN-10430
> URL: https://issues.apache.org/jira/browse/YARN-10430
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Minor
> Attachments: YARN-10430.001.patch
>
>
> I think in below places log should be printed only if list size is not zero.
> {code:java}
> if (LOG.isDebugEnabled()) {
>   LOG.debug("The cache log aggregation status size:"
>   + logAggregationReports.size());
>  }
> {code}
> {code:java}
> LOG.info("Sending out " + containerStatuses.size()
>   + " NM container statuses: " + containerStatuses);
> {code}
> {code:java}
> if (LOG.isDebugEnabled()) {
>   LOG.debug("Sending out " + containerStatuses.size()
>   + " container statuses: " + containerStatuses);
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10430) Log improvements in NodeStatusUpdaterImpl

2020-09-14 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195727#comment-17195727
 ] 

Jim Brennan commented on YARN-10430:


Thanks for the patch [~BilwaST]!  +1 this looks good to me.

 

> Log improvements in NodeStatusUpdaterImpl
> -
>
> Key: YARN-10430
> URL: https://issues.apache.org/jira/browse/YARN-10430
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Minor
> Attachments: YARN-10430.001.patch
>
>
> I think in below places log should be printed only if list size is not zero.
> {code:java}
> if (LOG.isDebugEnabled()) {
>   LOG.debug("The cache log aggregation status size:"
>   + logAggregationReports.size());
>  }
> {code}
> {code:java}
> LOG.info("Sending out " + containerStatuses.size()
>   + " NM container statuses: " + containerStatuses);
> {code}
> {code:java}
> if (LOG.isDebugEnabled()) {
>   LOG.debug("Sending out " + containerStatuses.size()
>   + " container statuses: " + containerStatuses);
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-14 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195694#comment-17195694
 ] 

Jim Brennan commented on YARN-10393:


[~wzzdreamer], [~adam.antal], what do you think of the draft patches?  Would be 
good to come to a resolution on this.

 

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
> Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at 

[jira] [Commented] (YARN-9604) RM Shutdown with FATAL Exception

2020-09-14 Thread chan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195240#comment-17195240
 ] 

chan commented on YARN-9604:


i think it may be caused by am release the container and am release the 
container will not get the lock

> RM Shutdown with FATAL Exception
> 
>
> Key: YARN-9604
> URL: https://issues.apache.org/jira/browse/YARN-9604
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Amithsha
>Priority: Critical
>
> Earlier faced the FATAL Exception and got resolved by adding the following 
> properties.
>   
>     yarn.scheduler.capacity.rack-locality-additional-delay
>     1
>   
>  
>   
>     yarn.scheduler.capacity.node-locality-delay
>     0
>   
> https://issues.apache.org/jira/browse/YARN-8462 (Patch and describtion)
>  
> Recently facing same FATAL Exception with different stacktrace.
>  
>  
>  
> 2019-06-06 08:30:38,424 FATAL event.EventDispatcher (?:?(?)) - Error in 
> handling event type NODE_UPDATE to the Event Dispatcher
> java.lang.NullPointerException
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:814)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:876)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:55)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:868)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.allocateFromReservedContainer(LeafQueue.java:1002)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1026)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1274)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1430)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1205)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:1067)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1472)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:151)
>  at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
>  at java.lang.Thread.run(Thread.java:745)
> 2019-06-06 08:30:38,424 INFO event.EventDispatcher (?:?(?)) - Exiting, bbye..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10395) ReservedContainer Node is added to blackList of application due to this node can not allocate other container

2020-09-14 Thread chan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chan updated YARN-10395:

Fix Version/s: (was: 2.9.2)
 Target Version/s:   (was: 2.9.2)
Affects Version/s: (was: 2.9.2)

> ReservedContainer Node is added to blackList of application due to this node 
> can not allocate other container
> -
>
> Key: YARN-10395
> URL: https://issues.apache.org/jira/browse/YARN-10395
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: chan
>Priority: Major
> Attachments: Yarn-10395-001.patch
>
>
> Now,if a app reserved a node,but the node is added to app`s blacklist.
> when this node send  heartbeat to resourcemanager,the reserved container 
> allocate fails,it will make this node can not allocate other container even 
> thought this node have enough memory or vcores.so i think we can release this 
> reserved container when the reserved node is in the black list of this app.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4575) ApplicationResourceUsageReport should return ALL reserved resource

2020-09-14 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195217#comment-17195217
 ] 

Xiaoqiao He commented on YARN-4575:
---

Thanks [~bibinchundatt],[~epayne],[~ebadger] for your works, it seems missing 
mark assignee for this issue. I just assignee to [~bibinchundatt] who is the 
original author for prepare 3.2.2 release. Please let me know if something I am 
missing. Thanks again.

> ApplicationResourceUsageReport should return ALL  reserved resource
> ---
>
> Key: YARN-4575
> URL: https://issues.apache.org/jira/browse/YARN-4575
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin Chundatt
>Assignee: Bibin Chundatt
>Priority: Major
>  Labels: oct16-easy
> Fix For: 3.2.2, 3.1.4, 2.10.1, 3.4.0, 3.3.1
>
> Attachments: 0001-YARN-4575.patch, 0002-YARN-4575.patch, 
> YARN-4575.003.patch, YARN-4575.004.patch, YARN-4575.005.patch, 
> YARN-4575.branch-3.1..005.patch
>
>
> ApplicationResourceUsageReport reserved resource report  is only of default 
> parition should be of all partitions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-4575) ApplicationResourceUsageReport should return ALL reserved resource

2020-09-14 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-4575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He reassigned YARN-4575:
-

Assignee: Bibin Chundatt

> ApplicationResourceUsageReport should return ALL  reserved resource
> ---
>
> Key: YARN-4575
> URL: https://issues.apache.org/jira/browse/YARN-4575
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin Chundatt
>Assignee: Bibin Chundatt
>Priority: Major
>  Labels: oct16-easy
> Fix For: 3.2.2, 3.1.4, 2.10.1, 3.4.0, 3.3.1
>
> Attachments: 0001-YARN-4575.patch, 0002-YARN-4575.patch, 
> YARN-4575.003.patch, YARN-4575.004.patch, YARN-4575.005.patch, 
> YARN-4575.branch-3.1..005.patch
>
>
> ApplicationResourceUsageReport reserved resource report  is only of default 
> parition should be of all partitions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10332) RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state

2020-09-14 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He reassigned YARN-10332:
--

Assignee: yehuanhuan

Add [~yehuanhuan] to contributors list and assign this issue to him for prepare 
3.2.2 release.

> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state
> 
>
> Key: YARN-10332
> URL: https://issues.apache.org/jira/browse/YARN-10332
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: yehuanhuan
>Assignee: yehuanhuan
>Priority: Minor
> Fix For: 3.2.2, 3.4.0, 3.3.1
>
> Attachments: YARN-10332.001.patch
>
>
> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org