[jira] [Updated] (YARN-10430) Log improvements in NodeStatusUpdaterImpl
[ https://issues.apache.org/jira/browse/YARN-10430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan updated YARN-10430: --- Component/s: nodemanager > Log improvements in NodeStatusUpdaterImpl > - > > Key: YARN-10430 > URL: https://issues.apache.org/jira/browse/YARN-10430 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Minor > Fix For: 3.3.1, 3.4.1 > > Attachments: YARN-10430.001.patch > > > I think in below places log should be printed only if list size is not zero. > {code:java} > if (LOG.isDebugEnabled()) { > LOG.debug("The cache log aggregation status size:" > + logAggregationReports.size()); > } > {code} > {code:java} > LOG.info("Sending out " + containerStatuses.size() > + " NM container statuses: " + containerStatuses); > {code} > {code:java} > if (LOG.isDebugEnabled()) { > LOG.debug("Sending out " + containerStatuses.size() > + " container statuses: " + containerStatuses); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10430) Log improvements in NodeStatusUpdaterImpl
[ https://issues.apache.org/jira/browse/YARN-10430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan updated YARN-10430: --- Fix Version/s: 3.4.1 3.3.1 > Log improvements in NodeStatusUpdaterImpl > - > > Key: YARN-10430 > URL: https://issues.apache.org/jira/browse/YARN-10430 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Minor > Fix For: 3.3.1, 3.4.1 > > Attachments: YARN-10430.001.patch > > > I think in below places log should be printed only if list size is not zero. > {code:java} > if (LOG.isDebugEnabled()) { > LOG.debug("The cache log aggregation status size:" > + logAggregationReports.size()); > } > {code} > {code:java} > LOG.info("Sending out " + containerStatuses.size() > + " NM container statuses: " + containerStatuses); > {code} > {code:java} > if (LOG.isDebugEnabled()) { > LOG.debug("Sending out " + containerStatuses.size() > + " container statuses: " + containerStatuses); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10430) Log improvements in NodeStatusUpdaterImpl
[ https://issues.apache.org/jira/browse/YARN-10430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195742#comment-17195742 ] Jim Brennan commented on YARN-10430: Thanks [~BilwaST]! I have committed this to trunk and branch-3.3. It did not apply cleanly to branch-3.2 and earlier, so please put up a patch for those if desired. > Log improvements in NodeStatusUpdaterImpl > - > > Key: YARN-10430 > URL: https://issues.apache.org/jira/browse/YARN-10430 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Minor > Attachments: YARN-10430.001.patch > > > I think in below places log should be printed only if list size is not zero. > {code:java} > if (LOG.isDebugEnabled()) { > LOG.debug("The cache log aggregation status size:" > + logAggregationReports.size()); > } > {code} > {code:java} > LOG.info("Sending out " + containerStatuses.size() > + " NM container statuses: " + containerStatuses); > {code} > {code:java} > if (LOG.isDebugEnabled()) { > LOG.debug("Sending out " + containerStatuses.size() > + " container statuses: " + containerStatuses); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10430) Log improvements in NodeStatusUpdaterImpl
[ https://issues.apache.org/jira/browse/YARN-10430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195727#comment-17195727 ] Jim Brennan commented on YARN-10430: Thanks for the patch [~BilwaST]! +1 this looks good to me. > Log improvements in NodeStatusUpdaterImpl > - > > Key: YARN-10430 > URL: https://issues.apache.org/jira/browse/YARN-10430 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Minor > Attachments: YARN-10430.001.patch > > > I think in below places log should be printed only if list size is not zero. > {code:java} > if (LOG.isDebugEnabled()) { > LOG.debug("The cache log aggregation status size:" > + logAggregationReports.size()); > } > {code} > {code:java} > LOG.info("Sending out " + containerStatuses.size() > + " NM container statuses: " + containerStatuses); > {code} > {code:java} > if (LOG.isDebugEnabled()) { > LOG.debug("Sending out " + containerStatuses.size() > + " container statuses: " + containerStatuses); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195694#comment-17195694 ] Jim Brennan commented on YARN-10393: [~wzzdreamer], [~adam.antal], what do you think of the draft patches? Would be good to come to a resolution on this. > MR job live lock caused by completed state container leak in heartbeat > between node manager and RM > -- > > Key: YARN-10393 > URL: https://issues.apache.org/jira/browse/YARN-10393 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, > 3.4.0 >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch > > > This was a bug we had seen multiple times on Hadoop 2.6.2. And the following > analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. > We hadn't seen it after 2.9 in our env. However, it was because of the RPC > retry policy change and other changes. There's still a possibility even with > the current code if I didn't miss anything. > *High-level description:* > We had seen a starving mapper issue several times. The MR job stuck in a > live lock state and couldn't make any progress. The queue is full so the > pending mapper can’t get any resource to continue, and the application master > failed to preempt the reducer, thus causing the job to be stuck. The reason > why the application master didn’t preempt the reducer was that there was a > leaked container in assigned mappers. The node manager failed to report the > completed container to the resource manager. > *Detailed steps:* > > # Container_1501226097332_249991_01_000199 was assigned to > attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. > {code:java} > appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned > container container_1501226097332_249991_01_000199 to > attempt_1501226097332_249991_m_95_0 > {code} > # The container finished on 2017-08-08 16:02:53,313. > {code:java} > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1501226097332_249991_01_000199 transitioned from RUNNING > to EXITED_WITH_SUCCESS > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1501226097332_249991_01_000199 > {code} > # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 > 16:07:04,238. In fact, the heartbeat request is actually handled by resource > manager, however, the node manager failed to receive the response. Let’s > assume the heartBeatResponseId=$hid in node manager. According to our current > configuration, next heartbeat will be 10s later. > {code:java} > 2017-08-08 16:07:04,238 ERROR > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught > exception in status-updater > java.io.IOException: Failed on local exception: java.io.IOException: > Connection reset by peer; Host Details : local host is: ; destination host > is: XXX > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1472) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at
[jira] [Commented] (YARN-9604) RM Shutdown with FATAL Exception
[ https://issues.apache.org/jira/browse/YARN-9604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195240#comment-17195240 ] chan commented on YARN-9604: i think it may be caused by am release the container and am release the container will not get the lock > RM Shutdown with FATAL Exception > > > Key: YARN-9604 > URL: https://issues.apache.org/jira/browse/YARN-9604 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Amithsha >Priority: Critical > > Earlier faced the FATAL Exception and got resolved by adding the following > properties. > > yarn.scheduler.capacity.rack-locality-additional-delay > 1 > > > > yarn.scheduler.capacity.node-locality-delay > 0 > > https://issues.apache.org/jira/browse/YARN-8462 (Patch and describtion) > > Recently facing same FATAL Exception with different stacktrace. > > > > 2019-06-06 08:30:38,424 FATAL event.EventDispatcher (?:?(?)) - Error in > handling event type NODE_UPDATE to the Event Dispatcher > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:814) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:876) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:55) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:868) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.allocateFromReservedContainer(LeafQueue.java:1002) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1026) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1274) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1430) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1205) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:1067) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1472) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:151) > at > org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) > at java.lang.Thread.run(Thread.java:745) > 2019-06-06 08:30:38,424 INFO event.EventDispatcher (?:?(?)) - Exiting, bbye.. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10395) ReservedContainer Node is added to blackList of application due to this node can not allocate other container
[ https://issues.apache.org/jira/browse/YARN-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chan updated YARN-10395: Fix Version/s: (was: 2.9.2) Target Version/s: (was: 2.9.2) Affects Version/s: (was: 2.9.2) > ReservedContainer Node is added to blackList of application due to this node > can not allocate other container > - > > Key: YARN-10395 > URL: https://issues.apache.org/jira/browse/YARN-10395 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Reporter: chan >Priority: Major > Attachments: Yarn-10395-001.patch > > > Now,if a app reserved a node,but the node is added to app`s blacklist. > when this node send heartbeat to resourcemanager,the reserved container > allocate fails,it will make this node can not allocate other container even > thought this node have enough memory or vcores.so i think we can release this > reserved container when the reserved node is in the black list of this app. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4575) ApplicationResourceUsageReport should return ALL reserved resource
[ https://issues.apache.org/jira/browse/YARN-4575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195217#comment-17195217 ] Xiaoqiao He commented on YARN-4575: --- Thanks [~bibinchundatt],[~epayne],[~ebadger] for your works, it seems missing mark assignee for this issue. I just assignee to [~bibinchundatt] who is the original author for prepare 3.2.2 release. Please let me know if something I am missing. Thanks again. > ApplicationResourceUsageReport should return ALL reserved resource > --- > > Key: YARN-4575 > URL: https://issues.apache.org/jira/browse/YARN-4575 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin Chundatt >Assignee: Bibin Chundatt >Priority: Major > Labels: oct16-easy > Fix For: 3.2.2, 3.1.4, 2.10.1, 3.4.0, 3.3.1 > > Attachments: 0001-YARN-4575.patch, 0002-YARN-4575.patch, > YARN-4575.003.patch, YARN-4575.004.patch, YARN-4575.005.patch, > YARN-4575.branch-3.1..005.patch > > > ApplicationResourceUsageReport reserved resource report is only of default > parition should be of all partitions -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-4575) ApplicationResourceUsageReport should return ALL reserved resource
[ https://issues.apache.org/jira/browse/YARN-4575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaoqiao He reassigned YARN-4575: - Assignee: Bibin Chundatt > ApplicationResourceUsageReport should return ALL reserved resource > --- > > Key: YARN-4575 > URL: https://issues.apache.org/jira/browse/YARN-4575 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin Chundatt >Assignee: Bibin Chundatt >Priority: Major > Labels: oct16-easy > Fix For: 3.2.2, 3.1.4, 2.10.1, 3.4.0, 3.3.1 > > Attachments: 0001-YARN-4575.patch, 0002-YARN-4575.patch, > YARN-4575.003.patch, YARN-4575.004.patch, YARN-4575.005.patch, > YARN-4575.branch-3.1..005.patch > > > ApplicationResourceUsageReport reserved resource report is only of default > parition should be of all partitions -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10332) RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state
[ https://issues.apache.org/jira/browse/YARN-10332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaoqiao He reassigned YARN-10332: -- Assignee: yehuanhuan Add [~yehuanhuan] to contributors list and assign this issue to him for prepare 3.2.2 release. > RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state > > > Key: YARN-10332 > URL: https://issues.apache.org/jira/browse/YARN-10332 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 3.2.1 >Reporter: yehuanhuan >Assignee: yehuanhuan >Priority: Minor > Fix For: 3.2.2, 3.4.0, 3.3.1 > > Attachments: YARN-10332.001.patch > > > RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org