[jira] [Commented] (YARN-10208) Add metric in CapacityScheduler for evaluating the time difference between node heartbeats

2020-07-23 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163873#comment-17163873
 ] 

Hadoop QA commented on YARN-10208:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} docker {color} | {color:red}  0m 
19s{color} | {color:red} Docker failed to build yetus/hadoop:cce5a6f6094. 
{color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-10208 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12998554/YARN-10208.005.patch |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/26305/console |
| versions | git=2.17.1 |
| Powered by | Apache Yetus 0.12.0 https://yetus.apache.org |


This message was automatically generated.



> Add metric in CapacityScheduler for evaluating the time difference between 
> node heartbeats
> --
>
> Key: YARN-10208
> URL: https://issues.apache.org/jira/browse/YARN-10208
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Pranjal Protim Borah
>Assignee: Pranjal Protim Borah
>Priority: Minor
> Attachments: YARN-10208.001.patch, YARN-10208.002.patch, 
> YARN-10208.003.patch, YARN-10208.004.patch, YARN-10208.005.patch
>
>
> Metric measuring average time interval between node heartbeats in capacity 
> scheduler on node update event.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10320) Replace FSDataInputStream#read with readFully in Log Aggregation

2020-07-23 Thread Bilwa S T (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163817#comment-17163817
 ] 

Bilwa S T commented on YARN-10320:
--

Thanks for patch [~tanu.ajmera]

I think you need to replace read with readFully in below code too
{code:java}
 while ((len = in.read(buf)) != -1) {
  //If buffer contents within fileLength, write
  if (len < bytesLeft) {
outputStreamState.getOutputStream().write(buf, 0, len);
bytesLeft-=len;
  } else {
//else only write contents within fileLength, then exit early
outputStreamState.getOutputStream().write(buf, 0,
(int)bytesLeft);
break;
  }
}
{code}
 

> Replace FSDataInputStream#read with readFully in Log Aggregation
> 
>
> Key: YARN-10320
> URL: https://issues.apache.org/jira/browse/YARN-10320
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Tanu Ajmera
>Priority: Major
> Attachments: YARN-10320-001.patch, YARN-10320-002.patch
>
>
> Have observed Log Aggregation code has used FSDataInputStream#read instead of 
> readFully in multiple places like below. One of the place is fixed by 
> YARN-8106.
> This Jira targets to fix at all other places.
> LogAggregationIndexedFileController#loadUUIDFromLogFile
> {code}
>   byte[] b = new byte[uuid.length];
>   int actual = fsDataInputStream.read(b);
>   if (actual != uuid.length || Arrays.equals(b, uuid)) {
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10359) Log container report only if list is not empty

2020-07-23 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163784#comment-17163784
 ] 

Hadoop QA commented on YARN-10359:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} docker {color} | {color:red}  0m 
18s{color} | {color:red} Docker failed to build yetus/hadoop:cce5a6f6094. 
{color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-10359 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13008293/YARN-10359.001.patch |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/26304/console |
| versions | git=2.17.1 |
| Powered by | Apache Yetus 0.12.0 https://yetus.apache.org |


This message was automatically generated.



> Log container report only if list is not empty
> --
>
> Key: YARN-10359
> URL: https://issues.apache.org/jira/browse/YARN-10359
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Minor
> Attachments: YARN-10359.001.patch
>
>
> In NodeStatusUpdaterImpl print log only if containerReports list is  not empty
> {code:java}
> if (containerReports != null) {
> LOG.info("Registering with RM using containers :" + containerReports);
>  }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10359) Log container report only if list is not empty

2020-07-23 Thread Bilwa S T (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bilwa S T updated YARN-10359:
-
Attachment: YARN-10359.001.patch

> Log container report only if list is not empty
> --
>
> Key: YARN-10359
> URL: https://issues.apache.org/jira/browse/YARN-10359
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Minor
> Attachments: YARN-10359.001.patch
>
>
> In NodeStatusUpdaterImpl print log only if containerReports list is  not empty
> {code:java}
> if (containerReports != null) {
> LOG.info("Registering with RM using containers :" + containerReports);
>  }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10278) CapacityScheduler test framework ProportionalCapacityPreemptionPolicyMockFramework need some review

2020-07-23 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163766#comment-17163766
 ] 

Szilard Nemeth edited comment on YARN-10278 at 7/23/20, 5:00 PM:
-

Hi [~epayne],
Can you please look at the above 
[comment|https://issues.apache.org/jira/browse/YARN-10278?focusedCommentId=17161040=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17161040]?
 

Thanks.


was (Author: snemeth):
Hi [~epayne],
Can you please look at the above comment? 

Thanks.

> CapacityScheduler test framework 
> ProportionalCapacityPreemptionPolicyMockFramework need some review
> ---
>
> Key: YARN-10278
> URL: https://issues.apache.org/jira/browse/YARN-10278
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Gergely Pollak
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10278.001.patch, YARN-10278.002.patch, 
> YARN-10278.branch-3.1.001.patch, YARN-10278.branch-3.1.002.patch, 
> YARN-10278.branch-3.1.003.patch, YARN-10278.branch-3.2.001.patch, 
> YARN-10278.branch-3.2.002.patch, YARN-10278.branch-3.3.001.patch
>
>
> This test framework class mocks a bit too heavily, and simulates CS internal 
> behaviour with the mock methods over a point it is reasonably maintainable, 
> any internal change in CS is a major headscratch.
> A lot of tests depend on this class, so we should approach it carefully, but 
> I think it's wroth to examine this class if it can be made a bit more 
> resilient to changes, and easier to maintain. Or at least document it better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10278) CapacityScheduler test framework ProportionalCapacityPreemptionPolicyMockFramework need some review

2020-07-23 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163766#comment-17163766
 ] 

Szilard Nemeth commented on YARN-10278:
---

Hi [~epayne],
Can you please look at the above comment? 

Thanks.

> CapacityScheduler test framework 
> ProportionalCapacityPreemptionPolicyMockFramework need some review
> ---
>
> Key: YARN-10278
> URL: https://issues.apache.org/jira/browse/YARN-10278
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Gergely Pollak
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10278.001.patch, YARN-10278.002.patch, 
> YARN-10278.branch-3.1.001.patch, YARN-10278.branch-3.1.002.patch, 
> YARN-10278.branch-3.1.003.patch, YARN-10278.branch-3.2.001.patch, 
> YARN-10278.branch-3.2.002.patch, YARN-10278.branch-3.3.001.patch
>
>
> This test framework class mocks a bit too heavily, and simulates CS internal 
> behaviour with the mock methods over a point it is reasonably maintainable, 
> any internal change in CS is a major headscratch.
> A lot of tests depend on this class, so we should approach it carefully, but 
> I think it's wroth to examine this class if it can be made a bit more 
> resilient to changes, and easier to maintain. Or at least document it better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4771) Some containers can be skipped during log aggregation after NM restart

2020-07-23 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163705#comment-17163705
 ] 

Jim Brennan commented on YARN-4771:
---

I think this is ready for review.
cc: [~epayne], [~ebadger]

> Some containers can be skipped during log aggregation after NM restart
> --
>
> Key: YARN-4771
> URL: https://issues.apache.org/jira/browse/YARN-4771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.10.0, 3.2.1, 3.1.3
>Reporter: Jason Darrell Lowe
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-4771.001.patch, YARN-4771.002.patch, 
> YARN-4771.003.patch
>
>
> A container can be skipped during log aggregation after a work-preserving 
> nodemanager restart if the following events occur:
> # Container completes more than 
> yarn.nodemanager.duration-to-track-stopped-containers milliseconds before the 
> restart
> # At least one other container completes after the above container and before 
> the restart



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10304) Create an endpoint for remote application log directory path query

2020-07-23 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163600#comment-17163600
 ] 

Adam Antal commented on YARN-10304:
---

+1 from me. Thanks for working on this [~gandras].

> Create an endpoint for remote application log directory path query
> --
>
> Key: YARN-10304
> URL: https://issues.apache.org/jira/browse/YARN-10304
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Minor
> Attachments: YARN-10304.001.patch, YARN-10304.002.patch, 
> YARN-10304.003.patch, YARN-10304.004.patch, YARN-10304.005.patch, 
> YARN-10304.006.patch, YARN-10304.007.patch
>
>
> The logic of the aggregated log directory path determination (currently based 
> on configuration) is scattered around the codebase and duplicated multiple 
> times. By providing a separate class for creating the path for a specific 
> user, it allows for an abstraction over this logic. This could be used in 
> place of the previously duplicated logic, moreover, we could provide an 
> endpoint to query this path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10315) Avoid sending RMNodeResourceupdate event if resource is same

2020-07-23 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163460#comment-17163460
 ] 

Hudson commented on YARN-10315:
---

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18467 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/18467/])
YARN-10315. Avoid sending RMNodeResourceupdate event if resource is 
(bibinchundatt: rev bfcd775381f1e0b94b17ce3cfca7eade95df1ea8)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/TestFifoScheduler.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java


> Avoid sending RMNodeResourceupdate event if resource is same
> 
>
> Key: YARN-10315
> URL: https://issues.apache.org/jira/browse/YARN-10315
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Assignee: Sushil Ks
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10315.001.patch, YARN-10315.002.patch
>
>
> When the node is in DECOMMISSIONING state the RMNodeResourceUpdateEvent is 
> send for every heartbeat . Which will result in scheduler resource update.
> Avoid sending the same.
>  Scheduler node resource update iterates through all the queues for resource 
> update which is costly..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10332) RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state

2020-07-23 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163446#comment-17163446
 ] 

Adam Antal commented on YARN-10332:
---

I'm sorry, I missed this too. Nice catch [~yehuanhuan]. +1 for this change.

> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state
> 
>
> Key: YARN-10332
> URL: https://issues.apache.org/jira/browse/YARN-10332
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: yehuanhuan
>Priority: Minor
> Attachments: YARN-10332.001.patch
>
>
> RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5302) Yarn Application log Aggregation fails due to NM can not get correct HDFS delegation token II

2020-07-23 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-5302:
-
Summary: Yarn Application log Aggregation fails due to NM can not get 
correct HDFS delegation token II  (was: Yarn Application log Aggreagation fails 
due to NM can not get correct HDFS delegation token II)

> Yarn Application log Aggregation fails due to NM can not get correct HDFS 
> delegation token II
> -
>
> Key: YARN-5302
> URL: https://issues.apache.org/jira/browse/YARN-5302
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Reporter: Xianyin Xin
>Assignee: Xianyin Xin
>Priority: Major
>  Labels: oct16-medium
> Attachments: YARN-5032.001.patch, YARN-5032.002.patch, 
> YARN-5302.003.patch, YARN-5302.004.patch
>
>
> Different with YARN-5098, this happens at NM side. When NM recovers, 
> credentials are read from NMStateStore. When initialize app aggregators, 
> exception happens because of the overdue tokens. The app is a long running 
> service.
> {code:title=LogAggregationService.java}
>   protected void initAppAggregator(final ApplicationId appId, String user,
>   Credentials credentials, ContainerLogsRetentionPolicy 
> logRetentionPolicy,
>   Map appAcls,
>   LogAggregationContext logAggregationContext) {
> // Get user's FileSystem credentials
> final UserGroupInformation userUgi =
> UserGroupInformation.createRemoteUser(user);
> if (credentials != null) {
>   userUgi.addCredentials(credentials);
> }
>...
> try {
>   // Create the app dir
>   createAppDir(user, appId, userUgi);
> } catch (Exception e) {
>   appLogAggregator.disableLogAggregation();
>   if (!(e instanceof YarnRuntimeException)) {
> appDirException = new YarnRuntimeException(e);
>   } else {
> appDirException = (YarnRuntimeException)e;
>   }
>   appLogAggregators.remove(appId);
>   closeFileSystems(userUgi);
>   throw appDirException;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10315) Avoid sending RMNodeResourceupdate event if resource is same

2020-07-23 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163423#comment-17163423
 ] 

Adam Antal commented on YARN-10315:
---

+1 from me on v2. Thanks for the patch [~Sushil-K-S].

> Avoid sending RMNodeResourceupdate event if resource is same
> 
>
> Key: YARN-10315
> URL: https://issues.apache.org/jira/browse/YARN-10315
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Assignee: Sushil Ks
>Priority: Major
> Attachments: YARN-10315.001.patch, YARN-10315.002.patch
>
>
> When the node is in DECOMMISSIONING state the RMNodeResourceUpdateEvent is 
> send for every heartbeat . Which will result in scheduler resource update.
> Avoid sending the same.
>  Scheduler node resource update iterates through all the queues for resource 
> update which is costly..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager

2020-07-23 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163418#comment-17163418
 ] 

Adam Antal commented on YARN-10319:
---

Indeed, the test failure is not related. 
+1 from me, thanks for the effort [~prabhujoseph].

> Record Last N Scheduler Activities from ActivitiesManager
> -
>
> Key: YARN-10319
> URL: https://issues.apache.org/jira/browse/YARN-10319
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: activitiesmanager
> Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, 
> YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, 
> YARN-10319-004.patch, YARN-10319-005.patch, YARN-10319-006.patch
>
>
> ActivitiesManager records a call flow for a given nodeId or a last call flow. 
> This is useful when debugging the issue live where the user queries with 
> right nodeId. But capturing last N scheduler activities during the issue 
> period can help to debug the issue offline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager

2020-07-23 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163411#comment-17163411
 ] 

Prabhu Joseph commented on YARN-10319:
--

[~adam.antal] Failed testcase is not related, let me know if the latest patch  
[^YARN-10319-006.patch]  is fine. Thanks.

> Record Last N Scheduler Activities from ActivitiesManager
> -
>
> Key: YARN-10319
> URL: https://issues.apache.org/jira/browse/YARN-10319
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: activitiesmanager
> Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, 
> YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, 
> YARN-10319-004.patch, YARN-10319-005.patch, YARN-10319-006.patch
>
>
> ActivitiesManager records a call flow for a given nodeId or a last call flow. 
> This is useful when debugging the issue live where the user queries with 
> right nodeId. But capturing last N scheduler activities during the issue 
> period can help to debug the issue offline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5098) Yarn Application Log Aggregation fails due to NM can not get correct HDFS delegation token

2020-07-23 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-5098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-5098:
-
Summary: Yarn Application Log Aggregation fails due to NM can not get 
correct HDFS delegation token  (was: Yarn Application log Aggreagation fails 
due to NM can not get correct HDFS delegation token)

> Yarn Application Log Aggregation fails due to NM can not get correct HDFS 
> delegation token
> --
>
> Key: YARN-5098
> URL: https://issues.apache.org/jira/browse/YARN-5098
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Yesha Vora
>Assignee: Jian He
>Priority: Major
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: YARN-5098.1.patch, YARN-5098.1.patch, YARN-5098.2.patch, 
> YARN-5098.3.patch
>
>
> Environment : HA cluster
> Yarn application logs for long running application could not be gathered 
> because Nodemanager failed to talk to HDFS with below error.
> {code}
> 2016-05-16 18:18:28,533 INFO  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:finishLogAggregation(555)) - Application just 
> finished : application_1463170334122_0002
> 2016-05-16 18:18:28,545 WARN  ipc.Client (Client.java:run(705)) - Exception 
> encountered while connecting to the server :
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 171 for hrt_qa) can't be found in cache
> at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:375)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:583)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:398)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:752)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:748)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1719)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:747)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$3100(Client.java:398)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1597)
> at org.apache.hadoop.ipc.Client.call(Client.java:1439)
> at org.apache.hadoop.ipc.Client.call(Client.java:1386)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:240)
> at com.sun.proxy.$Proxy83.getServerDefaults(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getServerDefaults(ClientNamenodeProtocolTranslatorPB.java:282)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
> at com.sun.proxy.$Proxy84.getServerDefaults(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSClient.getServerDefaults(DFSClient.java:1018)
> at org.apache.hadoop.fs.Hdfs.getServerDefaults(Hdfs.java:156)
> at 
> org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:550)
> at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:687)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10364) Absolute Resource [memory=0] is considered as Percentage config type

2020-07-23 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-10364:


 Summary: Absolute Resource [memory=0] is considered as Percentage 
config type
 Key: YARN-10364
 URL: https://issues.apache.org/jira/browse/YARN-10364
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph


Absolute Resource [memory=0] is considered as Percentage config type. This 
causes failure while converting queues from Percentage to Absolute Resources 
automatically. 

*Repro:*

1. Queue A = 100% and child queues Queue A.B = 0%, A.C=100%

2. While converting above to absolute resource automatically, capacity of queue 
A = [memory=], A.B = [memory=0]

This fails with below as A is considered as Absolute Resource whereas B is 
considered as Percentage config type.

{code}
2020-07-23 09:36:40,499 WARN 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: 
CapacityScheduler configuration validation failed:java.io.IOException: Failed 
to re-init queues : Parent queue 'root.A' and child queue 'root.A.B' should use 
either percentage based capacityconfiguration or absolute resource together for 
label:
{code}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5305) Yarn Application log Aggregation fails due to NM can not get correct HDFS delegation token III

2020-07-23 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-5305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-5305:
-
Summary: Yarn Application log Aggregation fails due to NM can not get 
correct HDFS delegation token III  (was: Yarn Application log Aggreagation 
fails due to NM can not get correct HDFS delegation token III)

> Yarn Application log Aggregation fails due to NM can not get correct HDFS 
> delegation token III
> --
>
> Key: YARN-5305
> URL: https://issues.apache.org/jira/browse/YARN-5305
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Xianyin Xin
>Priority: Major
>
> Different with YARN-5098 and YARN-5302, this problem happens when AM submits 
> a startContainer request with a new HDFS token (say, tokenB) which is not 
> managed by YARN, so two tokens exist in the credentials of the user on NM, 
> one is tokenB, the other is the one renewed on RM (tokenA). If tokenB is 
> selected when connect to HDFS and tokenB expires, exception happens.
> Supplementary: this problem happen due to that AM didn't use the service name 
> as the token alias in credentials, so two tokens for the same service can 
> co-exist in one credentials. TokenSelector can only select the first matched 
> token, it doesn't care if the token is valid or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5305) Yarn Application Log Aggregation fails due to NM can not get correct HDFS delegation token III

2020-07-23 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-5305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-5305:
-
Summary: Yarn Application Log Aggregation fails due to NM can not get 
correct HDFS delegation token III  (was: Yarn Application log Aggregation fails 
due to NM can not get correct HDFS delegation token III)

> Yarn Application Log Aggregation fails due to NM can not get correct HDFS 
> delegation token III
> --
>
> Key: YARN-5305
> URL: https://issues.apache.org/jira/browse/YARN-5305
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Xianyin Xin
>Priority: Major
>
> Different with YARN-5098 and YARN-5302, this problem happens when AM submits 
> a startContainer request with a new HDFS token (say, tokenB) which is not 
> managed by YARN, so two tokens exist in the credentials of the user on NM, 
> one is tokenB, the other is the one renewed on RM (tokenA). If tokenB is 
> selected when connect to HDFS and tokenB expires, exception happens.
> Supplementary: this problem happen due to that AM didn't use the service name 
> as the token alias in credentials, so two tokens for the same service can 
> co-exist in one credentials. TokenSelector can only select the first matched 
> token, it doesn't care if the token is valid or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10364) Absolute Resource [memory=0] is considered as Percentage config type

2020-07-23 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10364:
-
Affects Version/s: 3.4.0

> Absolute Resource [memory=0] is considered as Percentage config type
> 
>
> Key: YARN-10364
> URL: https://issues.apache.org/jira/browse/YARN-10364
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> Absolute Resource [memory=0] is considered as Percentage config type. This 
> causes failure while converting queues from Percentage to Absolute Resources 
> automatically. 
> *Repro:*
> 1. Queue A = 100% and child queues Queue A.B = 0%, A.C=100%
> 2. While converting above to absolute resource automatically, capacity of 
> queue A = [memory=], A.B = [memory=0]
> This fails with below as A is considered as Absolute Resource whereas B is 
> considered as Percentage config type.
> {code}
> 2020-07-23 09:36:40,499 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: 
> CapacityScheduler configuration validation failed:java.io.IOException: Failed 
> to re-init queues : Parent queue 'root.A' and child queue 'root.A.B' should 
> use either percentage based capacityconfiguration or absolute resource 
> together for label:
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-23 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163335#comment-17163335
 ] 

Prabhu Joseph commented on YARN-10352:
--

[~wangda] The failing testcase if not related to this fix. Let me know if the 
latest patch  [^YARN-10352-006.patch] is fine, i will commit it if no comments. 
Thanks.

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch, 
> YARN-10352-006.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10315) Avoid sending RMNodeResourceupdate event if resource is same

2020-07-23 Thread Bibin Chundatt (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163246#comment-17163246
 ] 

Bibin Chundatt commented on YARN-10315:
---

+1 looks good to me .

[~adam.antal] will wait for  fee days before committing.

> Avoid sending RMNodeResourceupdate event if resource is same
> 
>
> Key: YARN-10315
> URL: https://issues.apache.org/jira/browse/YARN-10315
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Assignee: Sushil Ks
>Priority: Major
> Attachments: YARN-10315.001.patch, YARN-10315.002.patch
>
>
> When the node is in DECOMMISSIONING state the RMNodeResourceUpdateEvent is 
> send for every heartbeat . Which will result in scheduler resource update.
> Avoid sending the same.
>  Scheduler node resource update iterates through all the queues for resource 
> update which is costly..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Issue Comment Deleted] (YARN-3001) RM dies because of divide by zero

2020-07-23 Thread xin qian (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xin qian updated YARN-3001:
---
Comment: was deleted

(was: Hi, Zhihai Xu , do you know how to reproduce this problem?)

> RM dies because of divide by zero
> -
>
> Key: YARN-3001
> URL: https://issues.apache.org/jira/browse/YARN-3001
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.5.1
>Reporter: hoelog
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-3001.barnch-2.7.patch
>
>
> RM dies because of divide by zero exception.
> {code}
> 2014-12-31 21:27:05,022 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type NODE_UPDATE to the scheduler
> java.lang.ArithmeticException: / by zero
> at 
> org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator.computeAvailableContainers(DefaultResourceCalculator.java:37)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainer(LeafQueue.java:1332)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignOffSwitchContainers(LeafQueue.java:1218)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainersOnNode(LeafQueue.java:1177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:877)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:656)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:570)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:851)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:900)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:599)
> at java.lang.Thread.run(Thread.java:745)
> 2014-12-31 21:27:05,023 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org