[jira] [Commented] (YARN-10428) Zombie applications in the YARN queue using FAIR + sizebasedweight

2020-09-17 Thread Guang Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198052#comment-17198052
 ] 

Guang Yang commented on YARN-10428:
---

Hi [~wenningd]: as far as I know, mag is supposed to be non-negative unless 
there's a bug. 

 

 >>If it could, will it bring another issue? 

Could you elaborate on this ? As I understand it, the behavior is not changed 
when mag is negative.

 

Thanks for following up on this ticket.

> Zombie applications in the YARN queue using FAIR + sizebasedweight
> --
>
> Key: YARN-10428
> URL: https://issues.apache.org/jira/browse/YARN-10428
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.8.5
>Reporter: Guang Yang
>Priority: Major
> Attachments: YARN-10428.001.patch, YARN-10428.002.patch
>
>
> Seeing zombie jobs in the YARN queue that uses FAIR and size based weight 
> ordering policy .
> *Detection:*
> The YARN UI shows incorrect number of "Num Schedulable Applications".
> *Impact:*
> The queue has an upper limit of number of running applications, with zombie 
> job, it hits the limit even though the number of running applications is far 
> less than the limit. 
> *Workaround:*
> **Fail-over and restart Resource Manager process.
> *Analysis:*
> **In the heap dump, we can find the zombie jobs in the `FairOderingPolicy#
> schedulableEntities` (see attachment). Take application 
> "application_1599157165858_29429" for example, it is still in the  
> `FairOderingPolicy#schedulableEntities` set, however, if we check the log of 
> resource manager, we can see RM already tried to remove the application:
>  
> ./yarn-yarn-resourcemanager-ip-172-21-153-252.log.2020-09-04-04:2020-09-04 
> 04:32:19,730 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue 
> (ResourceManager Event Processor): Application removed - appId: 
> application_1599157165858_29429 user: svc_di_data_eng queue: core-data 
> #user-pending-applications: -3 #user-active-applications: 7 
> #queue-pending-applications: 0 #queue-active-applications: 21
>  
> So it appears RM failed to removed the application from the set.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10428) Zombie applications in the YARN queue using FAIR + sizebasedweight

2020-09-17 Thread Wenning Ding (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198032#comment-17198032
 ] 

Wenning Ding edited comment on YARN-10428 at 9/18/20, 12:37 AM:


[~yguang11] I noticed that you are checking the value mag.

I am wondering do you know if mag could be negative? If it could, will it bring 
another issue? 


was (Author: wenningd):
I noticed that you are checking the value mag.

I am wondering do you know if mag could be negative? If it could, will it bring 
another issue? 

> Zombie applications in the YARN queue using FAIR + sizebasedweight
> --
>
> Key: YARN-10428
> URL: https://issues.apache.org/jira/browse/YARN-10428
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.8.5
>Reporter: Guang Yang
>Priority: Major
> Attachments: YARN-10428.001.patch, YARN-10428.002.patch
>
>
> Seeing zombie jobs in the YARN queue that uses FAIR and size based weight 
> ordering policy .
> *Detection:*
> The YARN UI shows incorrect number of "Num Schedulable Applications".
> *Impact:*
> The queue has an upper limit of number of running applications, with zombie 
> job, it hits the limit even though the number of running applications is far 
> less than the limit. 
> *Workaround:*
> **Fail-over and restart Resource Manager process.
> *Analysis:*
> **In the heap dump, we can find the zombie jobs in the `FairOderingPolicy#
> schedulableEntities` (see attachment). Take application 
> "application_1599157165858_29429" for example, it is still in the  
> `FairOderingPolicy#schedulableEntities` set, however, if we check the log of 
> resource manager, we can see RM already tried to remove the application:
>  
> ./yarn-yarn-resourcemanager-ip-172-21-153-252.log.2020-09-04-04:2020-09-04 
> 04:32:19,730 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue 
> (ResourceManager Event Processor): Application removed - appId: 
> application_1599157165858_29429 user: svc_di_data_eng queue: core-data 
> #user-pending-applications: -3 #user-active-applications: 7 
> #queue-pending-applications: 0 #queue-active-applications: 21
>  
> So it appears RM failed to removed the application from the set.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10428) Zombie applications in the YARN queue using FAIR + sizebasedweight

2020-09-17 Thread Wenning Ding (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198032#comment-17198032
 ] 

Wenning Ding commented on YARN-10428:
-

I noticed that you are checking the value mag.

I am wondering do you know if mag could be negative? If it could, will it bring 
another issue? 

> Zombie applications in the YARN queue using FAIR + sizebasedweight
> --
>
> Key: YARN-10428
> URL: https://issues.apache.org/jira/browse/YARN-10428
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.8.5
>Reporter: Guang Yang
>Priority: Major
> Attachments: YARN-10428.001.patch, YARN-10428.002.patch
>
>
> Seeing zombie jobs in the YARN queue that uses FAIR and size based weight 
> ordering policy .
> *Detection:*
> The YARN UI shows incorrect number of "Num Schedulable Applications".
> *Impact:*
> The queue has an upper limit of number of running applications, with zombie 
> job, it hits the limit even though the number of running applications is far 
> less than the limit. 
> *Workaround:*
> **Fail-over and restart Resource Manager process.
> *Analysis:*
> **In the heap dump, we can find the zombie jobs in the `FairOderingPolicy#
> schedulableEntities` (see attachment). Take application 
> "application_1599157165858_29429" for example, it is still in the  
> `FairOderingPolicy#schedulableEntities` set, however, if we check the log of 
> resource manager, we can see RM already tried to remove the application:
>  
> ./yarn-yarn-resourcemanager-ip-172-21-153-252.log.2020-09-04-04:2020-09-04 
> 04:32:19,730 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue 
> (ResourceManager Event Processor): Application removed - appId: 
> application_1599157165858_29429 user: svc_di_data_eng queue: core-data 
> #user-pending-applications: -3 #user-active-applications: 7 
> #queue-pending-applications: 0 #queue-active-applications: 21
>  
> So it appears RM failed to removed the application from the set.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-17 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197981#comment-17197981
 ] 

Jim Brennan commented on YARN-10393:


[~wzzdreamer], did you want to put up a new patch (or PR) based on draft.2 
patch?  It will need to address [~adam.antal]'s comment above, and fix existing 
tests and maybe add another test for this case.

One other change I would make that is not in draft.2 patch is to add this line 
to removeOrTrackCompletedContainersFromContext(), like you had in your original 
PR:
{noformat}
  pendingCompletedContainers.remove(containerId);
{noformat}
It may be mostly redundant to the conditional 
{{pendingCompletedContainers.clear()}} in the calling run() method, but I think 
it makes sense to do a remove here because these are containers for which we 
have affirmatively received an ack from the RM.  It also "fixes" at least one 
unit test.

 

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
> Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> 

[jira] [Commented] (YARN-10396) Max applications calculation per queue disregards queue level settings in absolute mode

2020-09-17 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197915#comment-17197915
 ] 

Hadoop QA commented on YARN-10396:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 15m 
48s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} branch-3.2 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
29s{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
42s{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
34s{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
44s{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 33s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
29s{color} | {color:green} branch-3.2 passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
35s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
34s{color} | {color:green} branch-3.2 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
38s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
38s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
30s{color} | {color:green} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 0 new + 71 unchanged - 13 fixed = 71 total (was 84) 
{color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 26s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
27s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
37s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 91m 58s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
26s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}167m 24s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.metrics.TestSystemMetricsPublisherForV2 |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.TestZKConfigurationStore
 |
|   | 
hadoop.yarn.server.resourcemanager.metrics.TestCombinedSystemMetricsPublisher |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.TestFSSchedulerConfigurationStore
 |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | ClientAPI=1.40 ServerAPI=1.40 base: 
https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/182/artifact/out/Dockerfile
 |
| JIRA Issue | YARN-10396 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13010481/YARN-10396.branch-3.2.001.patch
 |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite 
unit 

[jira] [Commented] (YARN-9333) TestFairSchedulerPreemption.testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes fails intermittently

2020-09-17 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197878#comment-17197878
 ] 

Szilard Nemeth commented on YARN-9333:
--

Hi [~pbacsko], 

Thanks for working on this.

I can agree to your observation made with [this 
comment|https://issues.apache.org/jira/browse/YARN-9333?focusedCommentId=17192914=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17192914].
As I see [~adam.antal] gave a +1, I'm committing this. 

Thanks [~adam.antal] for the review. Committed to trunk and resolving this jira.

> TestFairSchedulerPreemption.testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes
>  fails intermittently
> 
>
> Key: YARN-9333
> URL: https://issues.apache.org/jira/browse/YARN-9333
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: yarn
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9333-001.patch, YARN-9333-002.patch, 
> YARN-9333-003.patch, YARN-9333-debug1.patch
>
>
> TestFairSchedulerPreemption.testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes
>  fails intermittent - observed in YARN-9311.
> {code}
> [ERROR] 
> testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes[MinSharePreemptionWithDRF](org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption)
>   Time elapsed: 11.056 s  <<< FAILURE!
> java.lang.AssertionError: Incorrect # of containers on the greedy app 
> expected:<6> but was:<4>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption.verifyPreemption(TestFairSchedulerPreemption.java:296)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption.verifyRelaxLocalityPreemption(TestFairSchedulerPreemption.java:537)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption.testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes(TestFairSchedulerPreemption.java:473)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runners.Suite.runChild(Suite.java:128)
>   at org.junit.runners.Suite.runChild(Suite.java:27)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> 

[jira] [Updated] (YARN-9333) TestFairSchedulerPreemption.testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes fails intermittently

2020-09-17 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-9333:
-
Fix Version/s: 3.4.0

> TestFairSchedulerPreemption.testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes
>  fails intermittently
> 
>
> Key: YARN-9333
> URL: https://issues.apache.org/jira/browse/YARN-9333
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: yarn
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9333-001.patch, YARN-9333-002.patch, 
> YARN-9333-003.patch, YARN-9333-debug1.patch
>
>
> TestFairSchedulerPreemption.testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes
>  fails intermittent - observed in YARN-9311.
> {code}
> [ERROR] 
> testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes[MinSharePreemptionWithDRF](org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption)
>   Time elapsed: 11.056 s  <<< FAILURE!
> java.lang.AssertionError: Incorrect # of containers on the greedy app 
> expected:<6> but was:<4>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption.verifyPreemption(TestFairSchedulerPreemption.java:296)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption.verifyRelaxLocalityPreemption(TestFairSchedulerPreemption.java:537)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption.testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes(TestFairSchedulerPreemption.java:473)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runners.Suite.runChild(Suite.java:128)
>   at org.junit.runners.Suite.runChild(Suite.java:27)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> 

[jira] [Updated] (YARN-9333) TestFairSchedulerPreemption.testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes fails intermittently

2020-09-17 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-9333:
-
Summary: 
TestFairSchedulerPreemption.testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes
 fails intermittently  (was: 
TestFairSchedulerPreemption.testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes
 fails intermittent)

> TestFairSchedulerPreemption.testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes
>  fails intermittently
> 
>
> Key: YARN-9333
> URL: https://issues.apache.org/jira/browse/YARN-9333
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: yarn
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9333-001.patch, YARN-9333-002.patch, 
> YARN-9333-003.patch, YARN-9333-debug1.patch
>
>
> TestFairSchedulerPreemption.testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes
>  fails intermittent - observed in YARN-9311.
> {code}
> [ERROR] 
> testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes[MinSharePreemptionWithDRF](org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption)
>   Time elapsed: 11.056 s  <<< FAILURE!
> java.lang.AssertionError: Incorrect # of containers on the greedy app 
> expected:<6> but was:<4>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption.verifyPreemption(TestFairSchedulerPreemption.java:296)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption.verifyRelaxLocalityPreemption(TestFairSchedulerPreemption.java:537)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption.testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes(TestFairSchedulerPreemption.java:473)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runners.Suite.runChild(Suite.java:128)
>   at org.junit.runners.Suite.runChild(Suite.java:27)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> 

[jira] [Commented] (YARN-10421) Create YarnDiagnosticsService to serve diagnostic queries

2020-09-17 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197870#comment-17197870
 ] 

Szilard Nemeth commented on YARN-10421:
---

Hi [~bteke],

Thanks for working on this.

Here are my comments:

1. In FederationInterceptorREST: The targetpath parameter has the same value 
(RMWSConsts.RM_WEB_SERVICE_PATH + RMWSConsts.SCHEDULE) for both 
getCommonIssueData and getCommonIssueList. Shouldn't you use 
COMMON_ISSUE_COLLECT in method getCommonIssueData?

2. Nit: There some added newlines for some files at the end, please remove 
those.

3. Nit: In FederationInterceptorREST, I would use the string for the exception 
"This feature is not yet implemented" or something like this instead of the 
current "Code is not implemented"

4. For 
org.apache.hadoop.yarn.server.router.webapp.MockRESTRequestInterceptor#getCommonIssueData,
 it doesn't clear why a single OK response is returned. Can you please add a 
comment to explain this behaviour?

5. Just by looking at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices#getCommonIssueData,
 it's hard to guess what values the issueId and issueParams fields could take.
Can you please add javadoc to the method?
UPDATE: Oh, I can see this is documented in RMWebServiceProtocol.

6. Can you add API documentation of these 2 new endpoints to the file 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/ResourceManagerRest.md?

7. The javadoc for 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServiceProtocol#getCommonIssueData
 seems wrong. These are not form params, they are query params as per the 
implementing code, instead.

8. Nit: Can you use uppercase name for 
org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#scriptLocation?

9. Nit: I would introduce an enum for argument types. You currently have it as 
a string constant here: 
org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#LIST_ISSUES_ARGUMENT

10. About the script location defined by: 
org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#scriptLocation.
 
What makes the file available at location "/tmp/diagnostics_collector.sh" ?

11. In the warning log in method 
org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#parseIssueType,
 please add the expected number of parameters as well to the message. I would 
also add a return statement after the log message for clarity.

12. Can you please rename the diagnostics_collector.sh, so that it somehow 
reflects it is intended to be used by tests?

13. There's a typo in comment of method 
org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsServiceTest#testListCommonIssuesValidCaseWithOptionsToBeSkipped:
 ambigious -> ambiguous

14. In testcase: 
org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsServiceTest#testListCommonIssuesValidCaseWithOptionsToBeSkipped:
 You may simplify the testcase dramatically by using a loop to check the issue 
id, name and parameter.

15. In testcase: 
org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsServiceTest#testParseIssueTypeValidCases:
 The comment says: 
// valid case: id, name, one parameter
But the test actually uses 2 parameters so this is a bit misleading.

> Create YarnDiagnosticsService to serve diagnostic queries 
> --
>
> Key: YARN-10421
> URL: https://issues.apache.org/jira/browse/YARN-10421
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10421.001.patch, YARN-10421.002.patch, 
> YARN-10421.003.patch
>
>
> YarnDiagnosticsServlet should run inside ResourceManager Daemon. The servlet 
> forks a separate process, which executes a shell/Python/etc script. Based on 
> the use-cases listed below the script collects information, bundles it and 
> sends it to UI2. The diagnostic options are the following:
>  # Application hanging: 
>  ** Application logs
>  ** Find the hanging container and get multiple Jstacks
>  ** ResourceManager logs during job lifecycle
>  ** NodeManager logs from NodeManager where the hanging containers of the 
> jobs ran
>  ** Job configuration from MapReduce HistoryServer, Spark HistoryServer, Tez 
> History URL
>  # Application failed: 
>  ** Application logs
>  ** ResourceManager logs during job lifecycle.
>  ** NodeManager logs from NodeManager where the hanging containers of the 
> jobs ran
>  ** Job Configuration from MapReduce HistoryServer, Spark HistoryServer, Tez 
> History URL.
>  ** Job related metrics like container, attempts.
>  # Scheduler related issue:
>  ** ResourceManager Scheduler logs with DEBUG enabled for 2 minutes.
>  ** Multiple Jstacks of ResourceManager
>  ** YARN and Scheduler Configuration
>  ** Cluster Scheduler API 

[jira] [Commented] (YARN-950) Ability to limit or avoid aggregating logs beyond a certain size

2020-09-17 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197732#comment-17197732
 ] 

Adam Antal commented on YARN-950:
-

Hi [~epayne],

Sorry for the late answer. I unassigned this from myself, but I'd be happy to 
help reviewing the patch if you have any.

For our internal use case it was enough to detect these large aggregated log 
files. That was implemented in YARN-8199, therefore I did not feel strong push 
for this feature from our side.

> Ability to limit or avoid aggregating logs beyond a certain size
> 
>
> Key: YARN-950
> URL: https://issues.apache.org/jira/browse/YARN-950
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, nodemanager
>Affects Versions: 0.23.9, 2.6.0
>Reporter: Jason Darrell Lowe
>Assignee: Adam Antal
>Priority: Major
>
> It would be nice if ops could configure a cluster such that any container log 
> beyond a configured size would either only have a portion of the log 
> aggregated or not aggregated at all.  This would help speed up the recovery 
> path for cases where a container creates an enormous log and fills a disk, as 
> currently it tries to aggregate the entire, enormous log rather than only 
> aggregating a small portion or simply deleting it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-950) Ability to limit or avoid aggregating logs beyond a certain size

2020-09-17 Thread Adam Antal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Antal reassigned YARN-950:
---

Assignee: (was: Adam Antal)

> Ability to limit or avoid aggregating logs beyond a certain size
> 
>
> Key: YARN-950
> URL: https://issues.apache.org/jira/browse/YARN-950
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, nodemanager
>Affects Versions: 0.23.9, 2.6.0
>Reporter: Jason Darrell Lowe
>Priority: Major
>
> It would be nice if ops could configure a cluster such that any container log 
> beyond a configured size would either only have a portion of the log 
> aggregated or not aggregated at all.  This would help speed up the recovery 
> path for cases where a container creates an enormous log and fills a disk, as 
> currently it tries to aggregate the entire, enormous log rather than only 
> aggregating a small portion or simply deleting it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10031) Create a general purpose log request with additional query parameters

2020-09-17 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197729#comment-17197729
 ] 

Adam Antal commented on YARN-10031:
---

Thanks for the response [~gandras].

I accidentally made an error in my previous comment, sorry for that. By 
simplification I meant this:
{code:java}
if (key.toString().equals(logRequest.getContainerId())) {
  ...
}
{code}
because this one statement covers the null check as well and is more concise.

Regarding the parameter collections I can understand the behaviour from the 
unit tests. IMO since the approach is fine, we can go ahead with the next step.

> Create a general purpose log request with additional query parameters
> -
>
> Key: YARN-10031
> URL: https://issues.apache.org/jira/browse/YARN-10031
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Adam Antal
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10031-WIP.001.patch, YARN-10031.001.patch
>
>
> The current endpoints are robust but not very flexible with regards to 
> filtering options. I suggest to add an endpoint which provides filtering 
> options.
> E.g.:
> In ATS we have multiple endpoints:
> /containers/{containerid}/logs/{filename}
> /containerlogs/{containerid}/{filename}
> We could add @QueryParams parameters to the REST endpoints like this:
> /containers/{containerid}/logs?fileName=stderr=FAILED=nm45



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10441) Add support for hadoop.http.rmwebapp.scheduler.page.class

2020-09-17 Thread D M Murali Krishna Reddy (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

D M Murali Krishna Reddy reassigned YARN-10441:
---

Assignee: D M Murali Krishna Reddy

> Add support for hadoop.http.rmwebapp.scheduler.page.class
> -
>
> Key: YARN-10441
> URL: https://issues.apache.org/jira/browse/YARN-10441
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Reporter: D M Murali Krishna Reddy
>Assignee: D M Murali Krishna Reddy
>Priority: Major
>
> In https://issues.apache.org/jira/browse/YARN-10361 the existing 
> configuration  of hadoop.http.rmwebapp.scheduler.page.class is updated to 
> yarn.http.rmwebapp.scheduler.page.class, which causes incompatibility with 
> old versions, It is better to make the old configuration deprecated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10441) Add support for hadoop.http.rmwebapp.scheduler.page.class

2020-09-17 Thread D M Murali Krishna Reddy (Jira)
D M Murali Krishna Reddy created YARN-10441:
---

 Summary: Add support for hadoop.http.rmwebapp.scheduler.page.class
 Key: YARN-10441
 URL: https://issues.apache.org/jira/browse/YARN-10441
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Reporter: D M Murali Krishna Reddy


In https://issues.apache.org/jira/browse/YARN-10361 the existing configuration  
of hadoop.http.rmwebapp.scheduler.page.class is updated to 
yarn.http.rmwebapp.scheduler.page.class, which causes incompatibility with old 
versions, It is better to make the old configuration deprecated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9333) TestFairSchedulerPreemption.testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes fails intermittent

2020-09-17 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197718#comment-17197718
 ] 

Adam Antal commented on YARN-9333:
--

I think the potential benefits outweigh the cons, so I'm +1 on pushing the 
patch in, if there are no strong objections.

It would be nice to double-check if it's not a product issue with the 
FairScheduler, but since it is a locally not reproducible issue we should 
assume it's infra related, and this imply that pushing the patch is a good 
solution.

> TestFairSchedulerPreemption.testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes
>  fails intermittent
> --
>
> Key: YARN-9333
> URL: https://issues.apache.org/jira/browse/YARN-9333
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: yarn
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9333-001.patch, YARN-9333-002.patch, 
> YARN-9333-003.patch, YARN-9333-debug1.patch
>
>
> TestFairSchedulerPreemption.testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes
>  fails intermittent - observed in YARN-9311.
> {code}
> [ERROR] 
> testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes[MinSharePreemptionWithDRF](org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption)
>   Time elapsed: 11.056 s  <<< FAILURE!
> java.lang.AssertionError: Incorrect # of containers on the greedy app 
> expected:<6> but was:<4>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption.verifyPreemption(TestFairSchedulerPreemption.java:296)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption.verifyRelaxLocalityPreemption(TestFairSchedulerPreemption.java:537)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption.testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes(TestFairSchedulerPreemption.java:473)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runners.Suite.runChild(Suite.java:128)
>   at org.junit.runners.Suite.runChild(Suite.java:27)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> 

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-17 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197715#comment-17197715
 ] 

Adam Antal commented on YARN-10393:
---

Thanks for the valuable comments [~Jim_Brennan] [~yuanbo] [~wzzdreamer].

I got through the reasoning and I agree with the solution. As Jim explained 
{{lastHeartBeatID}} is indeed not the right approach, but the conditional clear 
based on the missedHeartbeat looks good to me. Being honest, I feel like I 
couldn't give a confident +1 to this patch, but since we're having a draft 
patch now, maybe it's time to involve some senior folks to give a +1 to this 
patch.

Anyways, the draft looks good - one comment is to add a DEBUG or an INFO level 
logging when the {{missedHearbeat}} is true - something like this:
{code:java}
if (!missedHearbeat) {
  pendingCompletedContainers.clear();
} else {
  LOG.info("Skip clearing pending completed containers due to missed 
heartbeat.");
}
missedHearbeat = false;
{code}
Of course, you can figure this out from the logs anyways, but let's keep it 
explicit - it may also help investigating such cases in the future.

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
> Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at 

[jira] [Comment Edited] (YARN-10438) NPE while fetching container report for a node which is not there in active/decommissioned/lost/unhealthy nodes on RM

2020-09-17 Thread Raghvendra Singh (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197560#comment-17197560
 ] 

Raghvendra Singh edited comment on YARN-10438 at 9/17/20, 9:35 AM:
---

[~shubhamod] The following line of code is at line 520.

ApplicationAttemptId appAttemptId = containerId.getApplicationAttemptId();

Here is the method getContainerReport() from ClientRMService.java
{noformat}
@Override
  public GetContainerReportResponse getContainerReport(
  GetContainerReportRequest request) throws YarnException, IOException {
ContainerId containerId = request.getContainerId();
ApplicationAttemptId appAttemptId = containerId.getApplicationAttemptId();
ApplicationId appId = appAttemptId.getApplicationId();
UserGroupInformation callerUGI = getCallerUgi(appId,
AuditConstants.GET_CONTAINER_REPORT);
RMApp application = verifyUserAccessForRMApp(appId, callerUGI,
AuditConstants.GET_CONTAINER_REPORT, ApplicationAccessType.VIEW_APP,
false);
boolean allowAccess = checkAccess(callerUGI, application.getUser(),
ApplicationAccessType.VIEW_APP, application);
GetContainerReportResponse response = null;
if (allowAccess) {
  RMAppAttempt appAttempt = application.getAppAttempts().get(appAttemptId);
  if (appAttempt == null) {
throw new ApplicationAttemptNotFoundException(
"ApplicationAttempt with id '" + appAttemptId +
"' doesn't exist in RM.");
  }
  RMContainer rmContainer = this.rmContext.getScheduler().getRMContainer(
  containerId);
  if (rmContainer == null) {
throw new ContainerNotFoundException("Container with id '" + containerId
+ "' doesn't exist in RM.");
  }
  response = GetContainerReportResponse.newInstance(rmContainer
  .createContainerReport());
} else {
  throw new YarnException("User " + callerUGI.getShortUserName()
  + " does not have privilege to see this application " + appId);
}
return response;
  }
{noformat}

Following is method from ApplicationClientProtocolPBServiceImpl.java
{noformat}
  @Override
  public GetContainerReportResponseProto getContainerReport(
  RpcController controller, GetContainerReportRequestProto proto)
  throws ServiceException {
GetContainerReportRequestPBImpl request =
new GetContainerReportRequestPBImpl(proto);
try {
  GetContainerReportResponse response = real.getContainerReport(request);
  return ((GetContainerReportResponsePBImpl) response).getProto();
} catch (YarnException e) {
  throw new ServiceException(e);
} catch (IOException e) {
  throw new ServiceException(e);
}
  }
{noformat}


was (Author: raghvendra.s):
[~shubhamod] Following is line of code is at line 520.

ApplicationAttemptId appAttemptId = containerId.getApplicationAttemptId();

Here is the method getContainerReport() from ClientRMService.java
{noformat}
@Override
  public GetContainerReportResponse getContainerReport(
  GetContainerReportRequest request) throws YarnException, IOException {
ContainerId containerId = request.getContainerId();
ApplicationAttemptId appAttemptId = containerId.getApplicationAttemptId();
ApplicationId appId = appAttemptId.getApplicationId();
UserGroupInformation callerUGI = getCallerUgi(appId,
AuditConstants.GET_CONTAINER_REPORT);
RMApp application = verifyUserAccessForRMApp(appId, callerUGI,
AuditConstants.GET_CONTAINER_REPORT, ApplicationAccessType.VIEW_APP,
false);
boolean allowAccess = checkAccess(callerUGI, application.getUser(),
ApplicationAccessType.VIEW_APP, application);
GetContainerReportResponse response = null;
if (allowAccess) {
  RMAppAttempt appAttempt = application.getAppAttempts().get(appAttemptId);
  if (appAttempt == null) {
throw new ApplicationAttemptNotFoundException(
"ApplicationAttempt with id '" + appAttemptId +
"' doesn't exist in RM.");
  }
  RMContainer rmContainer = this.rmContext.getScheduler().getRMContainer(
  containerId);
  if (rmContainer == null) {
throw new ContainerNotFoundException("Container with id '" + containerId
+ "' doesn't exist in RM.");
  }
  response = GetContainerReportResponse.newInstance(rmContainer
  .createContainerReport());
} else {
  throw new YarnException("User " + callerUGI.getShortUserName()
  + " does not have privilege to see this application " + appId);
}
return response;
  }
{noformat}

Following is method from ApplicationClientProtocolPBServiceImpl.java
{noformat}
  @Override
  public GetContainerReportResponseProto getContainerReport(
  RpcController controller, GetContainerReportRequestProto proto)
  throws ServiceException {
GetContainerReportRequestPBImpl request =
 

[jira] [Commented] (YARN-10438) NPE while fetching container report for a node which is not there in active/decommissioned/lost/unhealthy nodes on RM

2020-09-17 Thread Raghvendra Singh (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197560#comment-17197560
 ] 

Raghvendra Singh commented on YARN-10438:
-

[~shubhamod] Following is line of code is at line 520.

ApplicationAttemptId appAttemptId = containerId.getApplicationAttemptId();

Here is the method getContainerReport() from ClientRMService.java
{noformat}
@Override
  public GetContainerReportResponse getContainerReport(
  GetContainerReportRequest request) throws YarnException, IOException {
ContainerId containerId = request.getContainerId();
ApplicationAttemptId appAttemptId = containerId.getApplicationAttemptId();
ApplicationId appId = appAttemptId.getApplicationId();
UserGroupInformation callerUGI = getCallerUgi(appId,
AuditConstants.GET_CONTAINER_REPORT);
RMApp application = verifyUserAccessForRMApp(appId, callerUGI,
AuditConstants.GET_CONTAINER_REPORT, ApplicationAccessType.VIEW_APP,
false);
boolean allowAccess = checkAccess(callerUGI, application.getUser(),
ApplicationAccessType.VIEW_APP, application);
GetContainerReportResponse response = null;
if (allowAccess) {
  RMAppAttempt appAttempt = application.getAppAttempts().get(appAttemptId);
  if (appAttempt == null) {
throw new ApplicationAttemptNotFoundException(
"ApplicationAttempt with id '" + appAttemptId +
"' doesn't exist in RM.");
  }
  RMContainer rmContainer = this.rmContext.getScheduler().getRMContainer(
  containerId);
  if (rmContainer == null) {
throw new ContainerNotFoundException("Container with id '" + containerId
+ "' doesn't exist in RM.");
  }
  response = GetContainerReportResponse.newInstance(rmContainer
  .createContainerReport());
} else {
  throw new YarnException("User " + callerUGI.getShortUserName()
  + " does not have privilege to see this application " + appId);
}
return response;
  }
{noformat}

Following is method from ApplicationClientProtocolPBServiceImpl.java
{noformat}
  @Override
  public GetContainerReportResponseProto getContainerReport(
  RpcController controller, GetContainerReportRequestProto proto)
  throws ServiceException {
GetContainerReportRequestPBImpl request =
new GetContainerReportRequestPBImpl(proto);
try {
  GetContainerReportResponse response = real.getContainerReport(request);
  return ((GetContainerReportResponsePBImpl) response).getProto();
} catch (YarnException e) {
  throw new ServiceException(e);
} catch (IOException e) {
  throw new ServiceException(e);
}
  }
{noformat}

> NPE while fetching container report for a node which is not there in 
> active/decommissioned/lost/unhealthy nodes on RM
> -
>
> Key: YARN-10438
> URL: https://issues.apache.org/jira/browse/YARN-10438
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: Raghvendra Singh
>Priority: Major
>
> Here is the Exception trace which we are seeing, we are suspecting because of 
> this exception RM is reaching in a state where it is no more allowing any new 
> job to run on the cluster.
> {noformat}
> 2020-09-15 07:08:15,496 WARN ipc.Server: IPC Server handler 18 on default 
> port 8032, call Call#1463486 Retry#0 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getContainerReport 
> from 10.39.91.205:49564 java.lang.NullPointerException at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:520)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466)
>  at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:639)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:999) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915)
> {noformat}
> We are seeing this issue with this specific node only, we do run this cluster 
> at a scale of around 500 nodes. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal

2020-09-17 Thread jufeng li (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jufeng li updated YARN-10440:
-
Description: 
RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. I 
can open  x:8088/cluster/apps/RUNNING but can not 
x:8088/cluster/scheduler.Those apps submited can not end itself and new 
apps can not be submited.just everything hangs but not RM,NM server. How can I 
fix this?help me,please!

 

here is the log:
{code:java}
//代码占位符
ttempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang 
clusterResource= type=NODE_LOCAL 
requestedPartition=
2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 

[jira] [Updated] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal

2020-09-17 Thread jufeng li (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jufeng li updated YARN-10440:
-
Description: 
RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. I 
can open  x:8088/cluster/apps/RUNNING but can not 
x:8088/cluster/scheduler.Those apps submited can not end itself and new 
apps can not be submited.just everything hangs but not RM,NM server. How can I 
fix this?help me,please!

 

here is the log:
{code:java}
ttempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang 
clusterResource= type=NODE_LOCAL 
requestedPartition=
2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer 

[jira] [Comment Edited] (YARN-10439) Yarn Service AM listens on all IP's on the machine

2020-09-17 Thread Bilwa S T (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197392#comment-17197392
 ] 

Bilwa S T edited comment on YARN-10439 at 9/17/20, 7:55 AM:


Thanks [~dmmkr] for patch. I have few comments
1. can you reuse YARN_SERVICE_PREFIX in YarnServiceConf.java?
2. You can reuse ServiceUtils#mandatoryEnvVariable instead of doing 
System.getEnv
3. Looks like UT failures are related. Please check
Thanks


was (Author: bilwast):
Thanks [~dmmkr] for patch. I have few comments
1. can you reuse YARN_SERVICE_PREFIX in YarnServiceConf.java?
2. Looks like UT failures are related. Please check
Thanks

> Yarn Service AM listens on all IP's on the machine
> --
>
> Key: YARN-10439
> URL: https://issues.apache.org/jira/browse/YARN-10439
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: security, yarn-native-services
>Reporter: D M Murali Krishna Reddy
>Assignee: D M Murali Krishna Reddy
>Priority: Minor
> Attachments: YARN-10439.001.patch
>
>
> In ClientAMService.java, rpc server is created without passing hostname, due 
> to which the client listens on 0.0.0.0, which is a bad practise.
>  
> {{InetSocketAddress address = {color:#cc7832}new 
> {color}InetSocketAddress({color:#6897bb}0{color}){color:#cc7832};{color}}}
> {{{color:#9876aa}server {color}= 
> rpc.getServer(ClientAMProtocol.{color:#cc7832}class, this, 
> {color}address{color:#cc7832}, {color}conf{color:#cc7832},{color} 
> {color:#9876aa}context{color}.{color:#9876aa}secretManager{color}{color:#cc7832},
>  {color}{color:#6897bb}1{color}){color:#cc7832};{color}}}
>  
> Also, a new configuration must be added similar to 
> "yarn.app.mapreduce.am.job.client.port-range", so that client can configure 
> port range for yarn service AM to bind.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal

2020-09-17 Thread jufeng li (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197472#comment-17197472
 ] 

jufeng li commented on YARN-10440:
--

I restart RM then recovered,and the infomation is gone。But I dump the heap last 
time

> resource manager hangs,and i cannot submit any new jobs,but rm and nm 
> processes are normal
> --
>
> Key: YARN-10440
> URL: https://issues.apache.org/jira/browse/YARN-10440
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.1
>Reporter: jufeng li
>Priority: Blocker
>
> RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. 
> I can open  x:8088/cluster/apps/RUNNING but can not 
> x:8088/cluster/scheduler.Those apps submited can not end itself and new 
> apps can not be submited.just everything hangs but not RM,NM server. How can 
> I fix this?help me,please!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal

2020-09-17 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197469#comment-17197469
 ] 

Surendra Singh Lilhore commented on YARN-10440:
---

Hi [~Jufeng],

If still you have cluster, can you attach output of "*jstack  >dump*"

> resource manager hangs,and i cannot submit any new jobs,but rm and nm 
> processes are normal
> --
>
> Key: YARN-10440
> URL: https://issues.apache.org/jira/browse/YARN-10440
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.1
>Reporter: jufeng li
>Priority: Blocker
>
> RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. 
> I can open  x:8088/cluster/apps/RUNNING but can not 
> x:8088/cluster/scheduler.Those apps submited can not end itself and new 
> apps can not be submited.just everything hangs but not RM,NM server. How can 
> I fix this?help me,please!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal

2020-09-17 Thread jufeng li (Jira)
jufeng li created YARN-10440:


 Summary: resource manager hangs,and i cannot submit any new 
jobs,but rm and nm processes are normal
 Key: YARN-10440
 URL: https://issues.apache.org/jira/browse/YARN-10440
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.1.1
Reporter: jufeng li


RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. I 
can open  x:8088/cluster/apps/RUNNING but can not 
x:8088/cluster/scheduler.Those apps submited can not end itself and new 
apps can not be submited.just everything hangs but not RM,NM server. How can I 
fix this?help me,please!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org