[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180959#comment-17180959 ] Yuanbo Liu commented on YARN-10393: --- [~Jim_Brennan] Thanks for the comments. > I think the concern is that if we remove that >pendingCompletedContainers.clear() This would be a potential memory leak if we remove "pendingCompletedContainers.clear()". I'd suggest that removing "!isContainerRecentlyStopped(containerId)" in NodeStatusUpdaterImpl.java[line: 613] would be good to fix this issue. {code:java} if (!isContainerRecentlyStopped(containerId)) { pendingCompletedContainers.put(containerId, containerStatus); }{code} Completed containers will be cached in 10mins(default value) until it timeouts or gets response from heartbeat. And 10mins cache for completed container is long enough for retrying sending requests through heartbeat (default interval is 10s). > MR job live lock caused by completed state container leak in heartbeat > between node manager and RM > -- > > Key: YARN-10393 > URL: https://issues.apache.org/jira/browse/YARN-10393 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, > 3.4.0 >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > > This was a bug we had seen multiple times on Hadoop 2.6.2. And the following > analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. > We hadn't seen it after 2.9 in our env. However, it was because of the RPC > retry policy change and other changes. There's still a possibility even with > the current code if I didn't miss anything. > *High-level description:* > We had seen a starving mapper issue several times. The MR job stuck in a > live lock state and couldn't make any progress. The queue is full so the > pending mapper can’t get any resource to continue, and the application master > failed to preempt the reducer, thus causing the job to be stuck. The reason > why the application master didn’t preempt the reducer was that there was a > leaked container in assigned mappers. The node manager failed to report the > completed container to the resource manager. > *Detailed steps:* > > # Container_1501226097332_249991_01_000199 was assigned to > attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. > {code:java} > appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned > container container_1501226097332_249991_01_000199 to > attempt_1501226097332_249991_m_95_0 > {code} > # The container finished on 2017-08-08 16:02:53,313. > {code:java} > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1501226097332_249991_01_000199 transitioned from RUNNING > to EXITED_WITH_SUCCESS > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1501226097332_249991_01_000199 > {code} > # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 > 16:07:04,238. In fact, the heartbeat request is actually handled by resource > manager, however, the node manager failed to receive the response. Let’s > assume the heartBeatResponseId=$hid in node manager. According to our current > configuration, next heartbeat will be 10s later. > {code:java} > 2017-08-08 16:07:04,238 ERROR > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught > exception in status-updater > java.io.IOException: Failed on local exception: java.io.IOException: > Connection reset by peer; Host Details : local host is: ; destination host > is: XXX > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1472) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at >
[jira] [Commented] (YARN-10397) SchedulerRequest should be forwarded to scheduler if custom scheduler supports placement constraints
[ https://issues.apache.org/jira/browse/YARN-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180852#comment-17180852 ] Hadoop QA commented on YARN-10397: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 39s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 24s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 58s{color} | {color:green} trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 51s{color} | {color:green} trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 43s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 53s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 24s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 40s{color} | {color:green} trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 35s{color} | {color:green} trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 42s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 40s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 49s{color} | {color:green} the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 42s{color} | {color:green} the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 42s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 33s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 23s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 35s{color} | {color:green} the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 32s{color} | {color:green} the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 43s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green}104m 11s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} | |
[jira] [Commented] (YARN-10397) SchedulerRequest should be forwarded to scheduler if custom scheduler supports placement constraints
[ https://issues.apache.org/jira/browse/YARN-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180784#comment-17180784 ] Íñigo Goiri commented on YARN-10397: +1 on [^YARN-10397.002.patch]. > SchedulerRequest should be forwarded to scheduler if custom scheduler > supports placement constraints > > > Key: YARN-10397 > URL: https://issues.apache.org/jira/browse/YARN-10397 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Minor > Attachments: YARN-10397.001.patch, YARN-10397.002.patch > > > Currently only CapacityScheduler supports placement constraints so request > gets forwarded only for capacityScheduler. Below exception will be thrown if > custom scheduler supports placement constraint > {code:java} > if (request.getSchedulingRequests() != null > && !request.getSchedulingRequests().isEmpty()) { > if (!(scheduler instanceof CapacityScheduler)) { > String message = "Found non empty SchedulingRequest of " > + "AllocateRequest for application=" + appAttemptId.toString() > + ", however the configured scheduler=" > + scheduler.getClass().getCanonicalName() > + " cannot handle placement constraints, rejecting this " > + "allocate operation"; > LOG.warn(message); > throw new YarnException(message); > } > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10397) SchedulerRequest should be forwarded to scheduler if custom scheduler supports placement constraints
[ https://issues.apache.org/jira/browse/YARN-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180774#comment-17180774 ] Bilwa S T commented on YARN-10397: -- Thanks [~elgoiri] for reviewing this. I have updated javadoc . Yes this is covered by UT TestCapacitySchedulerSchedulingRequestUpdate. This testcase checks if capacityscheduler supports placement constraint or not. bq. BTW, I'm guessing you are using your own scheduler that supports this? Yes we have our own scheduler which supports placement constraints. > SchedulerRequest should be forwarded to scheduler if custom scheduler > supports placement constraints > > > Key: YARN-10397 > URL: https://issues.apache.org/jira/browse/YARN-10397 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Minor > Attachments: YARN-10397.001.patch, YARN-10397.002.patch > > > Currently only CapacityScheduler supports placement constraints so request > gets forwarded only for capacityScheduler. Below exception will be thrown if > custom scheduler supports placement constraint > {code:java} > if (request.getSchedulingRequests() != null > && !request.getSchedulingRequests().isEmpty()) { > if (!(scheduler instanceof CapacityScheduler)) { > String message = "Found non empty SchedulingRequest of " > + "AllocateRequest for application=" + appAttemptId.toString() > + ", however the configured scheduler=" > + scheduler.getClass().getCanonicalName() > + " cannot handle placement constraints, rejecting this " > + "allocate operation"; > LOG.warn(message); > throw new YarnException(message); > } > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10397) SchedulerRequest should be forwarded to scheduler if custom scheduler supports placement constraints
[ https://issues.apache.org/jira/browse/YARN-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated YARN-10397: - Attachment: YARN-10397.002.patch > SchedulerRequest should be forwarded to scheduler if custom scheduler > supports placement constraints > > > Key: YARN-10397 > URL: https://issues.apache.org/jira/browse/YARN-10397 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Minor > Attachments: YARN-10397.001.patch, YARN-10397.002.patch > > > Currently only CapacityScheduler supports placement constraints so request > gets forwarded only for capacityScheduler. Below exception will be thrown if > custom scheduler supports placement constraint > {code:java} > if (request.getSchedulingRequests() != null > && !request.getSchedulingRequests().isEmpty()) { > if (!(scheduler instanceof CapacityScheduler)) { > String message = "Found non empty SchedulingRequest of " > + "AllocateRequest for application=" + appAttemptId.toString() > + ", however the configured scheduler=" > + scheduler.getClass().getCanonicalName() > + " cannot handle placement constraints, rejecting this " > + "allocate operation"; > LOG.warn(message); > throw new YarnException(message); > } > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10397) SchedulerRequest should be forwarded to scheduler if custom scheduler supports placement constraints
[ https://issues.apache.org/jira/browse/YARN-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180735#comment-17180735 ] Íñigo Goiri commented on YARN-10397: It would be nice to add a javadoc placementConstraintEnabled() in both the abstract scheduler and the capacity scheduler specifying this. BTW, is this covered by any unit test? BTW, I'm guessing you are using your own scheduler that supports this? > SchedulerRequest should be forwarded to scheduler if custom scheduler > supports placement constraints > > > Key: YARN-10397 > URL: https://issues.apache.org/jira/browse/YARN-10397 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Minor > Attachments: YARN-10397.001.patch > > > Currently only CapacityScheduler supports placement constraints so request > gets forwarded only for capacityScheduler. Below exception will be thrown if > custom scheduler supports placement constraint > {code:java} > if (request.getSchedulingRequests() != null > && !request.getSchedulingRequests().isEmpty()) { > if (!(scheduler instanceof CapacityScheduler)) { > String message = "Found non empty SchedulingRequest of " > + "AllocateRequest for application=" + appAttemptId.toString() > + ", however the configured scheduler=" > + scheduler.getClass().getCanonicalName() > + " cannot handle placement constraints, rejecting this " > + "allocate operation"; > LOG.warn(message); > throw new YarnException(message); > } > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10397) SchedulerRequest should be forwarded to scheduler if custom scheduler supports placement constraints
[ https://issues.apache.org/jira/browse/YARN-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated YARN-10397: - Priority: Minor (was: Major) > SchedulerRequest should be forwarded to scheduler if custom scheduler > supports placement constraints > > > Key: YARN-10397 > URL: https://issues.apache.org/jira/browse/YARN-10397 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Minor > Attachments: YARN-10397.001.patch > > > Currently only CapacityScheduler supports placement constraints so request > gets forwarded only for capacityScheduler. Below exception will be thrown if > custom scheduler supports placement constraint > {code:java} > if (request.getSchedulingRequests() != null > && !request.getSchedulingRequests().isEmpty()) { > if (!(scheduler instanceof CapacityScheduler)) { > String message = "Found non empty SchedulingRequest of " > + "AllocateRequest for application=" + appAttemptId.toString() > + ", however the configured scheduler=" > + scheduler.getClass().getCanonicalName() > + " cannot handle placement constraints, rejecting this " > + "allocate operation"; > LOG.warn(message); > throw new YarnException(message); > } > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10399) Refactor NodeQueueLoadMonitor class to make it extendable
[ https://issues.apache.org/jira/browse/YARN-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180704#comment-17180704 ] Íñigo Goiri commented on YARN-10399: Merged the PR. Thanks [~zhengbl] for the PR and [~BilwaST] for the review. > Refactor NodeQueueLoadMonitor class to make it extendable > - > > Key: YARN-10399 > URL: https://issues.apache.org/jira/browse/YARN-10399 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.2.1 >Reporter: Zhengbo Li >Assignee: Zhengbo Li >Priority: Minor > Fix For: 3.4.0 > > > Currently the NodeQueueLoadMonitor is written is a way that doesn't allow > overriding the node selection logic, this refactor will extract the core > logic piece to allow extended class to override the default logics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180612#comment-17180612 ] Jim Brennan commented on YARN-10393: Thanks [~wzzdreamer] for submitting this with such a detailed analysis. While I agree with your analysis, I am not sure I agree with the solution. It seems to me that the change you made to NodeStatusUpdaterImpl.removeOrTrackCompletedContainersFromContext() is all that is required to ensure that the completed container status is not lost. I don't think you need to change the RM/NM protocol to manually resend the last NodeHeartbeatRequest again. As you noted, the RPC retry logic is already doing that. Also note that there is a lot of other state in that request, so I am not sure of the implications of not sending the most recent status for all that other state. Changing the protocol seems scary. But the change you made in removeOrTrackCompletedContainersFromContext() seems to go directly to the problem. The current code is always clearing pendingCompletedContainers at the end of that function. I've read through [YARN-2997] and it seems like this was a late addition to the patch, but it is not clear to me why it was added. I think the concern is that if we remove that pendingCompletedContainers.clear(), there is a potential to leak entries in pendingCompletedContainers. The question is whether it is possible to miss an ack in the heartbeat response for a completed container. > MR job live lock caused by completed state container leak in heartbeat > between node manager and RM > -- > > Key: YARN-10393 > URL: https://issues.apache.org/jira/browse/YARN-10393 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, > 3.4.0 >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > > This was a bug we had seen multiple times on Hadoop 2.6.2. And the following > analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. > We hadn't seen it after 2.9 in our env. However, it was because of the RPC > retry policy change and other changes. There's still a possibility even with > the current code if I didn't miss anything. > *High-level description:* > We had seen a starving mapper issue several times. The MR job stuck in a > live lock state and couldn't make any progress. The queue is full so the > pending mapper can’t get any resource to continue, and the application master > failed to preempt the reducer, thus causing the job to be stuck. The reason > why the application master didn’t preempt the reducer was that there was a > leaked container in assigned mappers. The node manager failed to report the > completed container to the resource manager. > *Detailed steps:* > > # Container_1501226097332_249991_01_000199 was assigned to > attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. > {code:java} > appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned > container container_1501226097332_249991_01_000199 to > attempt_1501226097332_249991_m_95_0 > {code} > # The container finished on 2017-08-08 16:02:53,313. > {code:java} > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1501226097332_249991_01_000199 transitioned from RUNNING > to EXITED_WITH_SUCCESS > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1501226097332_249991_01_000199 > {code} > # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 > 16:07:04,238. In fact, the heartbeat request is actually handled by resource > manager, however, the node manager failed to receive the response. Let’s > assume the heartBeatResponseId=$hid in node manager. According to our current > configuration, next heartbeat will be 10s later. > {code:java} > 2017-08-08 16:07:04,238 ERROR > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught > exception in status-updater > java.io.IOException: Failed on local exception: java.io.IOException: > Connection reset by peer; Host Details : local host is: ; destination host > is: XXX > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1472) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at >
[jira] [Commented] (YARN-10304) Create an endpoint for remote application log directory path query
[ https://issues.apache.org/jira/browse/YARN-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180578#comment-17180578 ] Hadoop QA commented on YARN-10304: -- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 8s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 1s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 33s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 20m 41s{color} | {color:green} trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 17m 29s{color} | {color:green} trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 44s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 11s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 20m 33s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 52s{color} | {color:green} trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 1s{color} | {color:green} trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 0m 54s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 6s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 21s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 20m 6s{color} | {color:green} the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 20m 6s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 17m 29s{color} | {color:green} the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 17m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 43s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 19s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 50s{color} | {color:green} the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 0s{color} | {color:green} the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 32s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || |
[jira] [Commented] (YARN-10368) Log aggregation reset to NOT_START after RM restart.
[ https://issues.apache.org/jira/browse/YARN-10368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180529#comment-17180529 ] Amithsha commented on YARN-10368: - [~singhania.anuj18] can you explain a bit more? Regarding the configs. > Log aggregation reset to NOT_START after RM restart. > > > Key: YARN-10368 > URL: https://issues.apache.org/jira/browse/YARN-10368 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, resourcemanager, yarn >Affects Versions: 3.2.1 >Reporter: Anuj >Priority: Major > Attachments: Screenshot 2020-07-27 at 2.35.15 PM.png > > > Attempt recovered after RM restart the log aggregation status is not > preserved and it come to NOT_START. > From NOT_START it never moves to TIMED_OUT and then never cleaned up RM App > in memory resulting max-completed-app in memory limit hit and RM stops > accepting new apps. > https://issues.apache.org/jira/browse/YARN-7952 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10304) Create an endpoint for remote application log directory path query
[ https://issues.apache.org/jira/browse/YARN-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180460#comment-17180460 ] Andras Gyori commented on YARN-10304: - Uploaded a new patch rebased on recent trunk to retrigger jenkins job. > Create an endpoint for remote application log directory path query > -- > > Key: YARN-10304 > URL: https://issues.apache.org/jira/browse/YARN-10304 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Minor > Attachments: YARN-10304.001.patch, YARN-10304.002.patch, > YARN-10304.003.patch, YARN-10304.004.patch, YARN-10304.005.patch, > YARN-10304.006.patch, YARN-10304.007.patch, YARN-10304.008.patch, > YARN-10304.009.patch > > > The logic of the aggregated log directory path determination (currently based > on configuration) is scattered around the codebase and duplicated multiple > times. By providing a separate class for creating the path for a specific > user, it allows for an abstraction over this logic. This could be used in > place of the previously duplicated logic, moreover, we could provide an > endpoint to query this path. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10304) Create an endpoint for remote application log directory path query
[ https://issues.apache.org/jira/browse/YARN-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andras Gyori updated YARN-10304: Attachment: YARN-10304.009.patch > Create an endpoint for remote application log directory path query > -- > > Key: YARN-10304 > URL: https://issues.apache.org/jira/browse/YARN-10304 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Minor > Attachments: YARN-10304.001.patch, YARN-10304.002.patch, > YARN-10304.003.patch, YARN-10304.004.patch, YARN-10304.005.patch, > YARN-10304.006.patch, YARN-10304.007.patch, YARN-10304.008.patch, > YARN-10304.009.patch > > > The logic of the aggregated log directory path determination (currently based > on configuration) is scattered around the codebase and duplicated multiple > times. By providing a separate class for creating the path for a specific > user, it allows for an abstraction over this logic. This could be used in > place of the previously duplicated logic, moreover, we could provide an > endpoint to query this path. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10368) Log aggregation reset to NOT_START after RM restart.
[ https://issues.apache.org/jira/browse/YARN-10368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180448#comment-17180448 ] Anuj commented on YARN-10368: - For now I have removed the check for log aggregation while cleanup. > Log aggregation reset to NOT_START after RM restart. > > > Key: YARN-10368 > URL: https://issues.apache.org/jira/browse/YARN-10368 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, resourcemanager, yarn >Affects Versions: 3.2.1 >Reporter: Anuj >Priority: Major > Attachments: Screenshot 2020-07-27 at 2.35.15 PM.png > > > Attempt recovered after RM restart the log aggregation status is not > preserved and it come to NOT_START. > From NOT_START it never moves to TIMED_OUT and then never cleaned up RM App > in memory resulting max-completed-app in memory limit hit and RM stops > accepting new apps. > https://issues.apache.org/jira/browse/YARN-7952 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10368) Log aggregation reset to NOT_START after RM restart.
[ https://issues.apache.org/jira/browse/YARN-10368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180435#comment-17180435 ] Amithsha commented on YARN-10368: - [~singhania.anuj18] The issue is because of https://issues.apache.org/jira/browse/YARN-4946 and to solve and revert https://issues.apache.org/jira/browse/YARN-10244 > Log aggregation reset to NOT_START after RM restart. > > > Key: YARN-10368 > URL: https://issues.apache.org/jira/browse/YARN-10368 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, resourcemanager, yarn >Affects Versions: 3.2.1 >Reporter: Anuj >Priority: Major > Attachments: Screenshot 2020-07-27 at 2.35.15 PM.png > > > Attempt recovered after RM restart the log aggregation status is not > preserved and it come to NOT_START. > From NOT_START it never moves to TIMED_OUT and then never cleaned up RM App > in memory resulting max-completed-app in memory limit hit and RM stops > accepting new apps. > https://issues.apache.org/jira/browse/YARN-7952 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4946) RM should not consider an application as COMPLETED when log aggregation is not in a terminal state
[ https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180376#comment-17180376 ] Amithsha commented on YARN-4946: Team Due to this patch we are hitting an issue where the app is not removed from Rm Memory 2020-08-19 12:51:21,725 INFO resourcemanager.RMAppManager (?:?(?)) - Max number of completed apps kept in state store met: maxCompletedAppsInStateStore = 300, but not removing app application_1590738905913_0019 from state store as log aggregation have not finished yet. > RM should not consider an application as COMPLETED when log aggregation is > not in a terminal state > -- > > Key: YARN-4946 > URL: https://issues.apache.org/jira/browse/YARN-4946 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation >Affects Versions: 2.8.0 >Reporter: Robert Kanter >Assignee: Szilard Nemeth >Priority: Major > Fix For: 3.2.0 > > Attachments: YARN-4946.001.patch, YARN-4946.002.patch, > YARN-4946.003.patch, YARN-4946.004.patch > > > MAPREDUCE-6415 added a tool that combines the aggregated log files for each > Yarn App into a HAR file. When run, it seeds the list by looking at the > aggregated logs directory, and then filters out ineligible apps. One of the > criteria involves checking with the RM that an Application's log aggregation > status is not still running and has not failed. When the RM "forgets" about > an older completed Application (e.g. RM failover, enough time has passed, > etc), the tool won't find the Application in the RM and will just assume that > its log aggregation succeeded, even if it actually failed or is still running. > We can solve this problem by doing the following: > The RM should not consider an app to be fully completed (and thus removed > from its history) until the aggregation status has reached a terminal state > (e.g. SUCCEEDED, FAILED, TIME_OUT). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10401) AggregateContainersPreempted in QueueMetrics is not correct when set yarn.scheduler.capacity.lazy-preemption-enabled as true
[ https://issues.apache.org/jira/browse/YARN-10401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juanjuan Tian updated YARN-10401: -- Attachment: YARN-10401-001.patch > AggregateContainersPreempted in QueueMetrics is not correct when set > yarn.scheduler.capacity.lazy-preemption-enabled as true > > > Key: YARN-10401 > URL: https://issues.apache.org/jira/browse/YARN-10401 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Juanjuan Tian >Assignee: Juanjuan Tian >Priority: Major > Attachments: YARN-10401-001.patch > > > AggregateContainersPreempted in QueueMetrics is always zero when set > yarn.scheduler.capacity.lazy-preemption-enabled as true -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10396) Max applications calculation per queue disregards queue level settings in absolute mode
[ https://issues.apache.org/jira/browse/YARN-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-10396: --- Fix Version/s: 3.1.5 3.3.1 3.2.2 > Max applications calculation per queue disregards queue level settings in > absolute mode > --- > > Key: YARN-10396 > URL: https://issues.apache.org/jira/browse/YARN-10396 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > Fix For: 3.2.2, 3.4.0, 3.3.1, 3.1.5 > > Attachments: YARN-10396.001.patch, YARN-10396.002.patch, > YARN-10396.003.patch > > > Looking at the following code in > {{org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.java#L1126}} > {code:java} > int maxApplications = (int) (conf.getMaximumSystemApplications() > * childQueue.getQueueCapacities().getAbsoluteCapacity(label)); > leafQueue.setMaxApplications(maxApplications);{code} > In Absolute Resources mode setting the number of maximum applications on > queue level gets overridden with the system level setting scaled down to the > available resources. This means that the only way to set the maximum number > of applications is to change the queue's resource pool. This line should > consider the queue's > {{yarn.scheduler.capacity.\{queuepath}.maximum-applications }}setting. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org