[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-08-19 Thread Yuanbo Liu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180959#comment-17180959
 ] 

Yuanbo Liu commented on YARN-10393:
---

[~Jim_Brennan]

Thanks for the comments.

> I think the concern is that if we remove that 
>pendingCompletedContainers.clear()

This would be a potential memory leak if we remove 
"pendingCompletedContainers.clear()".
I'd suggest that removing "!isContainerRecentlyStopped(containerId)" in 
NodeStatusUpdaterImpl.java[line: 613] would be good to fix this issue.
{code:java}
if (!isContainerRecentlyStopped(containerId)) {
 pendingCompletedContainers.put(containerId, containerStatus);
}{code}
Completed containers will be cached in 10mins(default value) until it timeouts 
or gets response from heartbeat. And 10mins cache for completed container is 
long enough for retrying sending requests through heartbeat (default interval 
is 10s).

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> 

[jira] [Commented] (YARN-10397) SchedulerRequest should be forwarded to scheduler if custom scheduler supports placement constraints

2020-08-19 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180852#comment-17180852
 ] 

Hadoop QA commented on YARN-10397:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
39s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 
24s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
58s{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
51s{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
43s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 24s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
40s{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
35s{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
42s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
40s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 23s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
32s{color} | {color:green} the patch passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
43s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}104m 
11s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch 
passed. {color} |
| 

[jira] [Commented] (YARN-10397) SchedulerRequest should be forwarded to scheduler if custom scheduler supports placement constraints

2020-08-19 Thread Jira


[ 
https://issues.apache.org/jira/browse/YARN-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180784#comment-17180784
 ] 

Íñigo Goiri commented on YARN-10397:


+1 on  [^YARN-10397.002.patch].

> SchedulerRequest should be forwarded to scheduler if custom scheduler 
> supports placement constraints
> 
>
> Key: YARN-10397
> URL: https://issues.apache.org/jira/browse/YARN-10397
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Minor
> Attachments: YARN-10397.001.patch, YARN-10397.002.patch
>
>
> Currently only CapacityScheduler supports placement constraints so request 
> gets forwarded only for capacityScheduler. Below exception will be thrown if 
> custom scheduler supports placement constraint
> {code:java}
> if (request.getSchedulingRequests() != null
> && !request.getSchedulingRequests().isEmpty()) {
>   if (!(scheduler instanceof CapacityScheduler)) {
> String message = "Found non empty SchedulingRequest of "
> + "AllocateRequest for application=" + appAttemptId.toString()
> + ", however the configured scheduler="
> + scheduler.getClass().getCanonicalName()
> + " cannot handle placement constraints, rejecting this "
> + "allocate operation";
> LOG.warn(message);
> throw new YarnException(message);
>   }
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10397) SchedulerRequest should be forwarded to scheduler if custom scheduler supports placement constraints

2020-08-19 Thread Bilwa S T (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180774#comment-17180774
 ] 

Bilwa S T commented on YARN-10397:
--

Thanks [~elgoiri] for reviewing this. I have updated javadoc .

Yes this is covered by UT TestCapacitySchedulerSchedulingRequestUpdate. This 
testcase checks if capacityscheduler supports placement constraint or not.

bq. BTW, I'm guessing you are using your own scheduler that supports this?

Yes we have our own scheduler which supports placement constraints.

> SchedulerRequest should be forwarded to scheduler if custom scheduler 
> supports placement constraints
> 
>
> Key: YARN-10397
> URL: https://issues.apache.org/jira/browse/YARN-10397
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Minor
> Attachments: YARN-10397.001.patch, YARN-10397.002.patch
>
>
> Currently only CapacityScheduler supports placement constraints so request 
> gets forwarded only for capacityScheduler. Below exception will be thrown if 
> custom scheduler supports placement constraint
> {code:java}
> if (request.getSchedulingRequests() != null
> && !request.getSchedulingRequests().isEmpty()) {
>   if (!(scheduler instanceof CapacityScheduler)) {
> String message = "Found non empty SchedulingRequest of "
> + "AllocateRequest for application=" + appAttemptId.toString()
> + ", however the configured scheduler="
> + scheduler.getClass().getCanonicalName()
> + " cannot handle placement constraints, rejecting this "
> + "allocate operation";
> LOG.warn(message);
> throw new YarnException(message);
>   }
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10397) SchedulerRequest should be forwarded to scheduler if custom scheduler supports placement constraints

2020-08-19 Thread Bilwa S T (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bilwa S T updated YARN-10397:
-
Attachment: YARN-10397.002.patch

> SchedulerRequest should be forwarded to scheduler if custom scheduler 
> supports placement constraints
> 
>
> Key: YARN-10397
> URL: https://issues.apache.org/jira/browse/YARN-10397
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Minor
> Attachments: YARN-10397.001.patch, YARN-10397.002.patch
>
>
> Currently only CapacityScheduler supports placement constraints so request 
> gets forwarded only for capacityScheduler. Below exception will be thrown if 
> custom scheduler supports placement constraint
> {code:java}
> if (request.getSchedulingRequests() != null
> && !request.getSchedulingRequests().isEmpty()) {
>   if (!(scheduler instanceof CapacityScheduler)) {
> String message = "Found non empty SchedulingRequest of "
> + "AllocateRequest for application=" + appAttemptId.toString()
> + ", however the configured scheduler="
> + scheduler.getClass().getCanonicalName()
> + " cannot handle placement constraints, rejecting this "
> + "allocate operation";
> LOG.warn(message);
> throw new YarnException(message);
>   }
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10397) SchedulerRequest should be forwarded to scheduler if custom scheduler supports placement constraints

2020-08-19 Thread Jira


[ 
https://issues.apache.org/jira/browse/YARN-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180735#comment-17180735
 ] 

Íñigo Goiri commented on YARN-10397:


It would be nice to add a javadoc placementConstraintEnabled() in both the 
abstract scheduler and the capacity scheduler specifying this.
BTW, is this covered by any unit test?

BTW, I'm guessing you are using your own scheduler that supports this?

> SchedulerRequest should be forwarded to scheduler if custom scheduler 
> supports placement constraints
> 
>
> Key: YARN-10397
> URL: https://issues.apache.org/jira/browse/YARN-10397
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Minor
> Attachments: YARN-10397.001.patch
>
>
> Currently only CapacityScheduler supports placement constraints so request 
> gets forwarded only for capacityScheduler. Below exception will be thrown if 
> custom scheduler supports placement constraint
> {code:java}
> if (request.getSchedulingRequests() != null
> && !request.getSchedulingRequests().isEmpty()) {
>   if (!(scheduler instanceof CapacityScheduler)) {
> String message = "Found non empty SchedulingRequest of "
> + "AllocateRequest for application=" + appAttemptId.toString()
> + ", however the configured scheduler="
> + scheduler.getClass().getCanonicalName()
> + " cannot handle placement constraints, rejecting this "
> + "allocate operation";
> LOG.warn(message);
> throw new YarnException(message);
>   }
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10397) SchedulerRequest should be forwarded to scheduler if custom scheduler supports placement constraints

2020-08-19 Thread Bilwa S T (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bilwa S T updated YARN-10397:
-
Priority: Minor  (was: Major)

> SchedulerRequest should be forwarded to scheduler if custom scheduler 
> supports placement constraints
> 
>
> Key: YARN-10397
> URL: https://issues.apache.org/jira/browse/YARN-10397
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Minor
> Attachments: YARN-10397.001.patch
>
>
> Currently only CapacityScheduler supports placement constraints so request 
> gets forwarded only for capacityScheduler. Below exception will be thrown if 
> custom scheduler supports placement constraint
> {code:java}
> if (request.getSchedulingRequests() != null
> && !request.getSchedulingRequests().isEmpty()) {
>   if (!(scheduler instanceof CapacityScheduler)) {
> String message = "Found non empty SchedulingRequest of "
> + "AllocateRequest for application=" + appAttemptId.toString()
> + ", however the configured scheduler="
> + scheduler.getClass().getCanonicalName()
> + " cannot handle placement constraints, rejecting this "
> + "allocate operation";
> LOG.warn(message);
> throw new YarnException(message);
>   }
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10399) Refactor NodeQueueLoadMonitor class to make it extendable

2020-08-19 Thread Jira


[ 
https://issues.apache.org/jira/browse/YARN-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180704#comment-17180704
 ] 

Íñigo Goiri commented on YARN-10399:


Merged the PR.
Thanks [~zhengbl] for the PR and [~BilwaST] for the review.

> Refactor NodeQueueLoadMonitor class to make it extendable
> -
>
> Key: YARN-10399
> URL: https://issues.apache.org/jira/browse/YARN-10399
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: Zhengbo Li
>Assignee: Zhengbo Li
>Priority: Minor
> Fix For: 3.4.0
>
>
> Currently the NodeQueueLoadMonitor is written is a way that doesn't allow 
> overriding the node selection logic, this refactor will extract the core 
> logic piece to allow extended class to override the default logics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-08-19 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180612#comment-17180612
 ] 

Jim Brennan commented on YARN-10393:


 Thanks [~wzzdreamer] for submitting this with such a detailed analysis.  While 
I agree with your analysis, I am not sure I agree with the solution.

It seems to me that the change you made to 
NodeStatusUpdaterImpl.removeOrTrackCompletedContainersFromContext() is all that 
is required to ensure that the completed container status is not lost.  I don't 
think you need to change the RM/NM protocol to manually resend the last 
NodeHeartbeatRequest again.  As you noted, the RPC retry logic is already doing 
that.  Also note that there is a lot of other state in that request, so I am 
not sure of the implications of not sending the most recent status for all that 
other state.  Changing the protocol seems scary.

But the change you made in removeOrTrackCompletedContainersFromContext() seems 
to go directly to the problem.  The current code is always clearing 
pendingCompletedContainers at the end of that function.  I've read through 
[YARN-2997] and it seems like this was a late addition to the patch, but it is 
not clear to me why it was added.

I think the concern is that if we remove that 
pendingCompletedContainers.clear(), there is a potential to leak entries in 
pendingCompletedContainers.  The question is whether it is possible to miss an 
ack in the heartbeat response for a completed container.


> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> 

[jira] [Commented] (YARN-10304) Create an endpoint for remote application log directory path query

2020-08-19 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180578#comment-17180578
 ] 

Hadoop QA commented on YARN-10304:
--

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m  
8s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m  
1s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
33s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 20m 
41s{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 17m 
29s{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  2m 
44s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
11s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
20m 33s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
52s{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m  
1s{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  0m 
54s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m  
6s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
21s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 20m  
6s{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 20m  
6s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 17m 
29s{color} | {color:green} the patch passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 17m 
29s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  2m 
43s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 19s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
50s{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m  
0s{color} | {color:green} the patch passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
32s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| 

[jira] [Commented] (YARN-10368) Log aggregation reset to NOT_START after RM restart.

2020-08-19 Thread Amithsha (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180529#comment-17180529
 ] 

Amithsha commented on YARN-10368:
-

[~singhania.anuj18] can you explain a bit more?  Regarding the configs.

> Log aggregation reset to NOT_START after RM restart.
> 
>
> Key: YARN-10368
> URL: https://issues.apache.org/jira/browse/YARN-10368
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, resourcemanager, yarn
>Affects Versions: 3.2.1
>Reporter: Anuj
>Priority: Major
> Attachments: Screenshot 2020-07-27 at 2.35.15 PM.png
>
>
> Attempt recovered after RM restart the log aggregation status is not 
> preserved and it come to NOT_START.
> From NOT_START it never moves to TIMED_OUT and then never cleaned up RM App 
> in memory resulting max-completed-app in memory limit hit and RM stops 
> accepting new apps.
> https://issues.apache.org/jira/browse/YARN-7952



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10304) Create an endpoint for remote application log directory path query

2020-08-19 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180460#comment-17180460
 ] 

Andras Gyori commented on YARN-10304:
-

Uploaded a new patch rebased on recent trunk to retrigger jenkins job.

> Create an endpoint for remote application log directory path query
> --
>
> Key: YARN-10304
> URL: https://issues.apache.org/jira/browse/YARN-10304
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Minor
> Attachments: YARN-10304.001.patch, YARN-10304.002.patch, 
> YARN-10304.003.patch, YARN-10304.004.patch, YARN-10304.005.patch, 
> YARN-10304.006.patch, YARN-10304.007.patch, YARN-10304.008.patch, 
> YARN-10304.009.patch
>
>
> The logic of the aggregated log directory path determination (currently based 
> on configuration) is scattered around the codebase and duplicated multiple 
> times. By providing a separate class for creating the path for a specific 
> user, it allows for an abstraction over this logic. This could be used in 
> place of the previously duplicated logic, moreover, we could provide an 
> endpoint to query this path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10304) Create an endpoint for remote application log directory path query

2020-08-19 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10304:

Attachment: YARN-10304.009.patch

> Create an endpoint for remote application log directory path query
> --
>
> Key: YARN-10304
> URL: https://issues.apache.org/jira/browse/YARN-10304
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Minor
> Attachments: YARN-10304.001.patch, YARN-10304.002.patch, 
> YARN-10304.003.patch, YARN-10304.004.patch, YARN-10304.005.patch, 
> YARN-10304.006.patch, YARN-10304.007.patch, YARN-10304.008.patch, 
> YARN-10304.009.patch
>
>
> The logic of the aggregated log directory path determination (currently based 
> on configuration) is scattered around the codebase and duplicated multiple 
> times. By providing a separate class for creating the path for a specific 
> user, it allows for an abstraction over this logic. This could be used in 
> place of the previously duplicated logic, moreover, we could provide an 
> endpoint to query this path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10368) Log aggregation reset to NOT_START after RM restart.

2020-08-19 Thread Anuj (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180448#comment-17180448
 ] 

Anuj commented on YARN-10368:
-

For now I have removed the check for log aggregation while cleanup.

> Log aggregation reset to NOT_START after RM restart.
> 
>
> Key: YARN-10368
> URL: https://issues.apache.org/jira/browse/YARN-10368
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, resourcemanager, yarn
>Affects Versions: 3.2.1
>Reporter: Anuj
>Priority: Major
> Attachments: Screenshot 2020-07-27 at 2.35.15 PM.png
>
>
> Attempt recovered after RM restart the log aggregation status is not 
> preserved and it come to NOT_START.
> From NOT_START it never moves to TIMED_OUT and then never cleaned up RM App 
> in memory resulting max-completed-app in memory limit hit and RM stops 
> accepting new apps.
> https://issues.apache.org/jira/browse/YARN-7952



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10368) Log aggregation reset to NOT_START after RM restart.

2020-08-19 Thread Amithsha (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180435#comment-17180435
 ] 

Amithsha commented on YARN-10368:
-

[~singhania.anuj18] 

The issue is because of https://issues.apache.org/jira/browse/YARN-4946

and to solve and revert https://issues.apache.org/jira/browse/YARN-10244

> Log aggregation reset to NOT_START after RM restart.
> 
>
> Key: YARN-10368
> URL: https://issues.apache.org/jira/browse/YARN-10368
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, resourcemanager, yarn
>Affects Versions: 3.2.1
>Reporter: Anuj
>Priority: Major
> Attachments: Screenshot 2020-07-27 at 2.35.15 PM.png
>
>
> Attempt recovered after RM restart the log aggregation status is not 
> preserved and it come to NOT_START.
> From NOT_START it never moves to TIMED_OUT and then never cleaned up RM App 
> in memory resulting max-completed-app in memory limit hit and RM stops 
> accepting new apps.
> https://issues.apache.org/jira/browse/YARN-7952



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4946) RM should not consider an application as COMPLETED when log aggregation is not in a terminal state

2020-08-19 Thread Amithsha (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180376#comment-17180376
 ] 

Amithsha commented on YARN-4946:


Team Due to this patch we are hitting an issue where the app is not removed 
from Rm Memory

2020-08-19 12:51:21,725 INFO  resourcemanager.RMAppManager (?:?(?)) - Max 
number of completed apps kept in state store met: maxCompletedAppsInStateStore 
= 300, but not removing app application_1590738905913_0019 from state store as 
log aggregation have not finished yet.

> RM should not consider an application as COMPLETED when log aggregation is 
> not in a terminal state
> --
>
> Key: YARN-4946
> URL: https://issues.apache.org/jira/browse/YARN-4946
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-4946.001.patch, YARN-4946.002.patch, 
> YARN-4946.003.patch, YARN-4946.004.patch
>
>
> MAPREDUCE-6415 added a tool that combines the aggregated log files for each 
> Yarn App into a HAR file.  When run, it seeds the list by looking at the 
> aggregated logs directory, and then filters out ineligible apps.  One of the 
> criteria involves checking with the RM that an Application's log aggregation 
> status is not still running and has not failed.  When the RM "forgets" about 
> an older completed Application (e.g. RM failover, enough time has passed, 
> etc), the tool won't find the Application in the RM and will just assume that 
> its log aggregation succeeded, even if it actually failed or is still running.
> We can solve this problem by doing the following:
> The RM should not consider an app to be fully completed (and thus removed 
> from its history) until the aggregation status has reached a terminal state 
> (e.g. SUCCEEDED, FAILED, TIME_OUT).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10401) AggregateContainersPreempted in QueueMetrics is not correct when set yarn.scheduler.capacity.lazy-preemption-enabled as true

2020-08-19 Thread Juanjuan Tian (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juanjuan Tian  updated YARN-10401:
--
Attachment: YARN-10401-001.patch

> AggregateContainersPreempted in QueueMetrics is not correct when set 
> yarn.scheduler.capacity.lazy-preemption-enabled as true
> 
>
> Key: YARN-10401
> URL: https://issues.apache.org/jira/browse/YARN-10401
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Juanjuan Tian 
>Assignee: Juanjuan Tian 
>Priority: Major
> Attachments: YARN-10401-001.patch
>
>
> AggregateContainersPreempted in QueueMetrics is always zero when set 
> yarn.scheduler.capacity.lazy-preemption-enabled as true



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10396) Max applications calculation per queue disregards queue level settings in absolute mode

2020-08-19 Thread Sunil G (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-10396:
---
Fix Version/s: 3.1.5
   3.3.1
   3.2.2

> Max applications calculation per queue disregards queue level settings in 
> absolute mode
> ---
>
> Key: YARN-10396
> URL: https://issues.apache.org/jira/browse/YARN-10396
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Fix For: 3.2.2, 3.4.0, 3.3.1, 3.1.5
>
> Attachments: YARN-10396.001.patch, YARN-10396.002.patch, 
> YARN-10396.003.patch
>
>
> Looking at the following code in 
> {{org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.java#L1126}}
> {code:java}
> int maxApplications = (int) (conf.getMaximumSystemApplications()
> * childQueue.getQueueCapacities().getAbsoluteCapacity(label));
> leafQueue.setMaxApplications(maxApplications);{code}
> In Absolute Resources mode setting the number of maximum applications on 
> queue level gets overridden with the system level setting scaled down to the 
> available resources. This means that the only way to set the maximum number 
> of applications is to change the queue's resource pool. This line should 
> consider the queue's 
> {{yarn.scheduler.capacity.\{queuepath}.maximum-applications }}setting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org