[jira] [Commented] (YARN-10208) Add metric in CapacityScheduler for evaluating the time difference between node heartbeats
[ https://issues.apache.org/jira/browse/YARN-10208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163873#comment-17163873 ] Hadoop QA commented on YARN-10208: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} docker {color} | {color:red} 0m 19s{color} | {color:red} Docker failed to build yetus/hadoop:cce5a6f6094. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | YARN-10208 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12998554/YARN-10208.005.patch | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/26305/console | | versions | git=2.17.1 | | Powered by | Apache Yetus 0.12.0 https://yetus.apache.org | This message was automatically generated. > Add metric in CapacityScheduler for evaluating the time difference between > node heartbeats > -- > > Key: YARN-10208 > URL: https://issues.apache.org/jira/browse/YARN-10208 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Pranjal Protim Borah >Assignee: Pranjal Protim Borah >Priority: Minor > Attachments: YARN-10208.001.patch, YARN-10208.002.patch, > YARN-10208.003.patch, YARN-10208.004.patch, YARN-10208.005.patch > > > Metric measuring average time interval between node heartbeats in capacity > scheduler on node update event. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10320) Replace FSDataInputStream#read with readFully in Log Aggregation
[ https://issues.apache.org/jira/browse/YARN-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163817#comment-17163817 ] Bilwa S T commented on YARN-10320: -- Thanks for patch [~tanu.ajmera] I think you need to replace read with readFully in below code too {code:java} while ((len = in.read(buf)) != -1) { //If buffer contents within fileLength, write if (len < bytesLeft) { outputStreamState.getOutputStream().write(buf, 0, len); bytesLeft-=len; } else { //else only write contents within fileLength, then exit early outputStreamState.getOutputStream().write(buf, 0, (int)bytesLeft); break; } } {code} > Replace FSDataInputStream#read with readFully in Log Aggregation > > > Key: YARN-10320 > URL: https://issues.apache.org/jira/browse/YARN-10320 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Tanu Ajmera >Priority: Major > Attachments: YARN-10320-001.patch, YARN-10320-002.patch > > > Have observed Log Aggregation code has used FSDataInputStream#read instead of > readFully in multiple places like below. One of the place is fixed by > YARN-8106. > This Jira targets to fix at all other places. > LogAggregationIndexedFileController#loadUUIDFromLogFile > {code} > byte[] b = new byte[uuid.length]; > int actual = fsDataInputStream.read(b); > if (actual != uuid.length || Arrays.equals(b, uuid)) { > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10359) Log container report only if list is not empty
[ https://issues.apache.org/jira/browse/YARN-10359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163784#comment-17163784 ] Hadoop QA commented on YARN-10359: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} docker {color} | {color:red} 0m 18s{color} | {color:red} Docker failed to build yetus/hadoop:cce5a6f6094. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | YARN-10359 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13008293/YARN-10359.001.patch | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/26304/console | | versions | git=2.17.1 | | Powered by | Apache Yetus 0.12.0 https://yetus.apache.org | This message was automatically generated. > Log container report only if list is not empty > -- > > Key: YARN-10359 > URL: https://issues.apache.org/jira/browse/YARN-10359 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Minor > Attachments: YARN-10359.001.patch > > > In NodeStatusUpdaterImpl print log only if containerReports list is not empty > {code:java} > if (containerReports != null) { > LOG.info("Registering with RM using containers :" + containerReports); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10359) Log container report only if list is not empty
[ https://issues.apache.org/jira/browse/YARN-10359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated YARN-10359: - Attachment: YARN-10359.001.patch > Log container report only if list is not empty > -- > > Key: YARN-10359 > URL: https://issues.apache.org/jira/browse/YARN-10359 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Minor > Attachments: YARN-10359.001.patch > > > In NodeStatusUpdaterImpl print log only if containerReports list is not empty > {code:java} > if (containerReports != null) { > LOG.info("Registering with RM using containers :" + containerReports); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10278) CapacityScheduler test framework ProportionalCapacityPreemptionPolicyMockFramework need some review
[ https://issues.apache.org/jira/browse/YARN-10278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163766#comment-17163766 ] Szilard Nemeth edited comment on YARN-10278 at 7/23/20, 5:00 PM: - Hi [~epayne], Can you please look at the above [comment|https://issues.apache.org/jira/browse/YARN-10278?focusedCommentId=17161040=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17161040]? Thanks. was (Author: snemeth): Hi [~epayne], Can you please look at the above comment? Thanks. > CapacityScheduler test framework > ProportionalCapacityPreemptionPolicyMockFramework need some review > --- > > Key: YARN-10278 > URL: https://issues.apache.org/jira/browse/YARN-10278 > Project: Hadoop YARN > Issue Type: Task >Reporter: Gergely Pollak >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-10278.001.patch, YARN-10278.002.patch, > YARN-10278.branch-3.1.001.patch, YARN-10278.branch-3.1.002.patch, > YARN-10278.branch-3.1.003.patch, YARN-10278.branch-3.2.001.patch, > YARN-10278.branch-3.2.002.patch, YARN-10278.branch-3.3.001.patch > > > This test framework class mocks a bit too heavily, and simulates CS internal > behaviour with the mock methods over a point it is reasonably maintainable, > any internal change in CS is a major headscratch. > A lot of tests depend on this class, so we should approach it carefully, but > I think it's wroth to examine this class if it can be made a bit more > resilient to changes, and easier to maintain. Or at least document it better. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10278) CapacityScheduler test framework ProportionalCapacityPreemptionPolicyMockFramework need some review
[ https://issues.apache.org/jira/browse/YARN-10278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163766#comment-17163766 ] Szilard Nemeth commented on YARN-10278: --- Hi [~epayne], Can you please look at the above comment? Thanks. > CapacityScheduler test framework > ProportionalCapacityPreemptionPolicyMockFramework need some review > --- > > Key: YARN-10278 > URL: https://issues.apache.org/jira/browse/YARN-10278 > Project: Hadoop YARN > Issue Type: Task >Reporter: Gergely Pollak >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-10278.001.patch, YARN-10278.002.patch, > YARN-10278.branch-3.1.001.patch, YARN-10278.branch-3.1.002.patch, > YARN-10278.branch-3.1.003.patch, YARN-10278.branch-3.2.001.patch, > YARN-10278.branch-3.2.002.patch, YARN-10278.branch-3.3.001.patch > > > This test framework class mocks a bit too heavily, and simulates CS internal > behaviour with the mock methods over a point it is reasonably maintainable, > any internal change in CS is a major headscratch. > A lot of tests depend on this class, so we should approach it carefully, but > I think it's wroth to examine this class if it can be made a bit more > resilient to changes, and easier to maintain. Or at least document it better. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4771) Some containers can be skipped during log aggregation after NM restart
[ https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163705#comment-17163705 ] Jim Brennan commented on YARN-4771: --- I think this is ready for review. cc: [~epayne], [~ebadger] > Some containers can be skipped during log aggregation after NM restart > -- > > Key: YARN-4771 > URL: https://issues.apache.org/jira/browse/YARN-4771 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.10.0, 3.2.1, 3.1.3 >Reporter: Jason Darrell Lowe >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-4771.001.patch, YARN-4771.002.patch, > YARN-4771.003.patch > > > A container can be skipped during log aggregation after a work-preserving > nodemanager restart if the following events occur: > # Container completes more than > yarn.nodemanager.duration-to-track-stopped-containers milliseconds before the > restart > # At least one other container completes after the above container and before > the restart -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10304) Create an endpoint for remote application log directory path query
[ https://issues.apache.org/jira/browse/YARN-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163600#comment-17163600 ] Adam Antal commented on YARN-10304: --- +1 from me. Thanks for working on this [~gandras]. > Create an endpoint for remote application log directory path query > -- > > Key: YARN-10304 > URL: https://issues.apache.org/jira/browse/YARN-10304 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Minor > Attachments: YARN-10304.001.patch, YARN-10304.002.patch, > YARN-10304.003.patch, YARN-10304.004.patch, YARN-10304.005.patch, > YARN-10304.006.patch, YARN-10304.007.patch > > > The logic of the aggregated log directory path determination (currently based > on configuration) is scattered around the codebase and duplicated multiple > times. By providing a separate class for creating the path for a specific > user, it allows for an abstraction over this logic. This could be used in > place of the previously duplicated logic, moreover, we could provide an > endpoint to query this path. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10315) Avoid sending RMNodeResourceupdate event if resource is same
[ https://issues.apache.org/jira/browse/YARN-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163460#comment-17163460 ] Hudson commented on YARN-10315: --- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18467 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/18467/]) YARN-10315. Avoid sending RMNodeResourceupdate event if resource is (bibinchundatt: rev bfcd775381f1e0b94b17ce3cfca7eade95df1ea8) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/TestFifoScheduler.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java > Avoid sending RMNodeResourceupdate event if resource is same > > > Key: YARN-10315 > URL: https://issues.apache.org/jira/browse/YARN-10315 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: Sushil Ks >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10315.001.patch, YARN-10315.002.patch > > > When the node is in DECOMMISSIONING state the RMNodeResourceUpdateEvent is > send for every heartbeat . Which will result in scheduler resource update. > Avoid sending the same. > Scheduler node resource update iterates through all the queues for resource > update which is costly.. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10332) RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state
[ https://issues.apache.org/jira/browse/YARN-10332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163446#comment-17163446 ] Adam Antal commented on YARN-10332: --- I'm sorry, I missed this too. Nice catch [~yehuanhuan]. +1 for this change. > RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state > > > Key: YARN-10332 > URL: https://issues.apache.org/jira/browse/YARN-10332 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 3.2.1 >Reporter: yehuanhuan >Priority: Minor > Attachments: YARN-10332.001.patch > > > RESOURCE_UPDATE event was repeatedly registered in DECOMMISSIONING state. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5302) Yarn Application log Aggregation fails due to NM can not get correct HDFS delegation token II
[ https://issues.apache.org/jira/browse/YARN-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth updated YARN-5302: - Summary: Yarn Application log Aggregation fails due to NM can not get correct HDFS delegation token II (was: Yarn Application log Aggreagation fails due to NM can not get correct HDFS delegation token II) > Yarn Application log Aggregation fails due to NM can not get correct HDFS > delegation token II > - > > Key: YARN-5302 > URL: https://issues.apache.org/jira/browse/YARN-5302 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Reporter: Xianyin Xin >Assignee: Xianyin Xin >Priority: Major > Labels: oct16-medium > Attachments: YARN-5032.001.patch, YARN-5032.002.patch, > YARN-5302.003.patch, YARN-5302.004.patch > > > Different with YARN-5098, this happens at NM side. When NM recovers, > credentials are read from NMStateStore. When initialize app aggregators, > exception happens because of the overdue tokens. The app is a long running > service. > {code:title=LogAggregationService.java} > protected void initAppAggregator(final ApplicationId appId, String user, > Credentials credentials, ContainerLogsRetentionPolicy > logRetentionPolicy, > Map appAcls, > LogAggregationContext logAggregationContext) { > // Get user's FileSystem credentials > final UserGroupInformation userUgi = > UserGroupInformation.createRemoteUser(user); > if (credentials != null) { > userUgi.addCredentials(credentials); > } >... > try { > // Create the app dir > createAppDir(user, appId, userUgi); > } catch (Exception e) { > appLogAggregator.disableLogAggregation(); > if (!(e instanceof YarnRuntimeException)) { > appDirException = new YarnRuntimeException(e); > } else { > appDirException = (YarnRuntimeException)e; > } > appLogAggregators.remove(appId); > closeFileSystems(userUgi); > throw appDirException; > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10315) Avoid sending RMNodeResourceupdate event if resource is same
[ https://issues.apache.org/jira/browse/YARN-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163423#comment-17163423 ] Adam Antal commented on YARN-10315: --- +1 from me on v2. Thanks for the patch [~Sushil-K-S]. > Avoid sending RMNodeResourceupdate event if resource is same > > > Key: YARN-10315 > URL: https://issues.apache.org/jira/browse/YARN-10315 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: Sushil Ks >Priority: Major > Attachments: YARN-10315.001.patch, YARN-10315.002.patch > > > When the node is in DECOMMISSIONING state the RMNodeResourceUpdateEvent is > send for every heartbeat . Which will result in scheduler resource update. > Avoid sending the same. > Scheduler node resource update iterates through all the queues for resource > update which is costly.. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager
[ https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163418#comment-17163418 ] Adam Antal commented on YARN-10319: --- Indeed, the test failure is not related. +1 from me, thanks for the effort [~prabhujoseph]. > Record Last N Scheduler Activities from ActivitiesManager > - > > Key: YARN-10319 > URL: https://issues.apache.org/jira/browse/YARN-10319 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: activitiesmanager > Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, > YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, > YARN-10319-004.patch, YARN-10319-005.patch, YARN-10319-006.patch > > > ActivitiesManager records a call flow for a given nodeId or a last call flow. > This is useful when debugging the issue live where the user queries with > right nodeId. But capturing last N scheduler activities during the issue > period can help to debug the issue offline. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager
[ https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163411#comment-17163411 ] Prabhu Joseph commented on YARN-10319: -- [~adam.antal] Failed testcase is not related, let me know if the latest patch [^YARN-10319-006.patch] is fine. Thanks. > Record Last N Scheduler Activities from ActivitiesManager > - > > Key: YARN-10319 > URL: https://issues.apache.org/jira/browse/YARN-10319 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: activitiesmanager > Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, > YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, > YARN-10319-004.patch, YARN-10319-005.patch, YARN-10319-006.patch > > > ActivitiesManager records a call flow for a given nodeId or a last call flow. > This is useful when debugging the issue live where the user queries with > right nodeId. But capturing last N scheduler activities during the issue > period can help to debug the issue offline. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5098) Yarn Application Log Aggregation fails due to NM can not get correct HDFS delegation token
[ https://issues.apache.org/jira/browse/YARN-5098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth updated YARN-5098: - Summary: Yarn Application Log Aggregation fails due to NM can not get correct HDFS delegation token (was: Yarn Application log Aggreagation fails due to NM can not get correct HDFS delegation token) > Yarn Application Log Aggregation fails due to NM can not get correct HDFS > delegation token > -- > > Key: YARN-5098 > URL: https://issues.apache.org/jira/browse/YARN-5098 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Yesha Vora >Assignee: Jian He >Priority: Major > Fix For: 2.8.0, 3.0.0-alpha1 > > Attachments: YARN-5098.1.patch, YARN-5098.1.patch, YARN-5098.2.patch, > YARN-5098.3.patch > > > Environment : HA cluster > Yarn application logs for long running application could not be gathered > because Nodemanager failed to talk to HDFS with below error. > {code} > 2016-05-16 18:18:28,533 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:finishLogAggregation(555)) - Application just > finished : application_1463170334122_0002 > 2016-05-16 18:18:28,545 WARN ipc.Client (Client.java:run(705)) - Exception > encountered while connecting to the server : > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > token (HDFS_DELEGATION_TOKEN token 171 for hrt_qa) can't be found in cache > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:375) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:583) > at > org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:398) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:752) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:748) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1719) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:747) > at > org.apache.hadoop.ipc.Client$Connection.access$3100(Client.java:398) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1597) > at org.apache.hadoop.ipc.Client.call(Client.java:1439) > at org.apache.hadoop.ipc.Client.call(Client.java:1386) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:240) > at com.sun.proxy.$Proxy83.getServerDefaults(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getServerDefaults(ClientNamenodeProtocolTranslatorPB.java:282) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > at com.sun.proxy.$Proxy84.getServerDefaults(Unknown Source) > at > org.apache.hadoop.hdfs.DFSClient.getServerDefaults(DFSClient.java:1018) > at org.apache.hadoop.fs.Hdfs.getServerDefaults(Hdfs.java:156) > at > org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:550) > at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:687) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10364) Absolute Resource [memory=0] is considered as Percentage config type
Prabhu Joseph created YARN-10364: Summary: Absolute Resource [memory=0] is considered as Percentage config type Key: YARN-10364 URL: https://issues.apache.org/jira/browse/YARN-10364 Project: Hadoop YARN Issue Type: Bug Reporter: Prabhu Joseph Assignee: Prabhu Joseph Absolute Resource [memory=0] is considered as Percentage config type. This causes failure while converting queues from Percentage to Absolute Resources automatically. *Repro:* 1. Queue A = 100% and child queues Queue A.B = 0%, A.C=100% 2. While converting above to absolute resource automatically, capacity of queue A = [memory=], A.B = [memory=0] This fails with below as A is considered as Absolute Resource whereas B is considered as Percentage config type. {code} 2020-07-23 09:36:40,499 WARN org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: CapacityScheduler configuration validation failed:java.io.IOException: Failed to re-init queues : Parent queue 'root.A' and child queue 'root.A.B' should use either percentage based capacityconfiguration or absolute resource together for label: {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5305) Yarn Application log Aggregation fails due to NM can not get correct HDFS delegation token III
[ https://issues.apache.org/jira/browse/YARN-5305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth updated YARN-5305: - Summary: Yarn Application log Aggregation fails due to NM can not get correct HDFS delegation token III (was: Yarn Application log Aggreagation fails due to NM can not get correct HDFS delegation token III) > Yarn Application log Aggregation fails due to NM can not get correct HDFS > delegation token III > -- > > Key: YARN-5305 > URL: https://issues.apache.org/jira/browse/YARN-5305 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Xianyin Xin >Priority: Major > > Different with YARN-5098 and YARN-5302, this problem happens when AM submits > a startContainer request with a new HDFS token (say, tokenB) which is not > managed by YARN, so two tokens exist in the credentials of the user on NM, > one is tokenB, the other is the one renewed on RM (tokenA). If tokenB is > selected when connect to HDFS and tokenB expires, exception happens. > Supplementary: this problem happen due to that AM didn't use the service name > as the token alias in credentials, so two tokens for the same service can > co-exist in one credentials. TokenSelector can only select the first matched > token, it doesn't care if the token is valid or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5305) Yarn Application Log Aggregation fails due to NM can not get correct HDFS delegation token III
[ https://issues.apache.org/jira/browse/YARN-5305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth updated YARN-5305: - Summary: Yarn Application Log Aggregation fails due to NM can not get correct HDFS delegation token III (was: Yarn Application log Aggregation fails due to NM can not get correct HDFS delegation token III) > Yarn Application Log Aggregation fails due to NM can not get correct HDFS > delegation token III > -- > > Key: YARN-5305 > URL: https://issues.apache.org/jira/browse/YARN-5305 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Xianyin Xin >Priority: Major > > Different with YARN-5098 and YARN-5302, this problem happens when AM submits > a startContainer request with a new HDFS token (say, tokenB) which is not > managed by YARN, so two tokens exist in the credentials of the user on NM, > one is tokenB, the other is the one renewed on RM (tokenA). If tokenB is > selected when connect to HDFS and tokenB expires, exception happens. > Supplementary: this problem happen due to that AM didn't use the service name > as the token alias in credentials, so two tokens for the same service can > co-exist in one credentials. TokenSelector can only select the first matched > token, it doesn't care if the token is valid or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10364) Absolute Resource [memory=0] is considered as Percentage config type
[ https://issues.apache.org/jira/browse/YARN-10364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10364: - Affects Version/s: 3.4.0 > Absolute Resource [memory=0] is considered as Percentage config type > > > Key: YARN-10364 > URL: https://issues.apache.org/jira/browse/YARN-10364 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > > Absolute Resource [memory=0] is considered as Percentage config type. This > causes failure while converting queues from Percentage to Absolute Resources > automatically. > *Repro:* > 1. Queue A = 100% and child queues Queue A.B = 0%, A.C=100% > 2. While converting above to absolute resource automatically, capacity of > queue A = [memory=], A.B = [memory=0] > This fails with below as A is considered as Absolute Resource whereas B is > considered as Percentage config type. > {code} > 2020-07-23 09:36:40,499 WARN > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: > CapacityScheduler configuration validation failed:java.io.IOException: Failed > to re-init queues : Parent queue 'root.A' and child queue 'root.A.B' should > use either percentage based capacityconfiguration or absolute resource > together for label: > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement
[ https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163335#comment-17163335 ] Prabhu Joseph commented on YARN-10352: -- [~wangda] The failing testcase if not related to this fix. Let me know if the latest patch [^YARN-10352-006.patch] is fine, i will commit it if no comments. Thanks. > Skip schedule on not heartbeated nodes in Multi Node Placement > -- > > Key: YARN-10352 > URL: https://issues.apache.org/jira/browse/YARN-10352 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.3.0, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: capacityscheduler, multi-node-placement > Attachments: YARN-10352-001.patch, YARN-10352-002.patch, > YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch, > YARN-10352-006.patch > > > When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM > Active Nodes will be still having those stopped nodes until NM Liveliness > Monitor Expires after configured timeout > (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, > Multi Node Placement assigns the containers on those nodes. They need to > exclude the nodes which has not heartbeated for configured heartbeat interval > (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to > Asynchronous Capacity Scheduler Threads. > (CapacityScheduler#shouldSkipNodeSchedule) > *Repro:* > 1. Enable Multi Node Placement > (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery > Enabled (yarn.node.recovery.enabled) > 2. Have only one NM running say worker0 > 3. Stop worker0 and start any other NM say worker1 > 4. Submit a sleep job. The containers will timeout as assigned to stopped NM > worker0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10315) Avoid sending RMNodeResourceupdate event if resource is same
[ https://issues.apache.org/jira/browse/YARN-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163246#comment-17163246 ] Bibin Chundatt commented on YARN-10315: --- +1 looks good to me . [~adam.antal] will wait for fee days before committing. > Avoid sending RMNodeResourceupdate event if resource is same > > > Key: YARN-10315 > URL: https://issues.apache.org/jira/browse/YARN-10315 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: Sushil Ks >Priority: Major > Attachments: YARN-10315.001.patch, YARN-10315.002.patch > > > When the node is in DECOMMISSIONING state the RMNodeResourceUpdateEvent is > send for every heartbeat . Which will result in scheduler resource update. > Avoid sending the same. > Scheduler node resource update iterates through all the queues for resource > update which is costly.. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Issue Comment Deleted] (YARN-3001) RM dies because of divide by zero
[ https://issues.apache.org/jira/browse/YARN-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xin qian updated YARN-3001: --- Comment: was deleted (was: Hi, Zhihai Xu , do you know how to reproduce this problem?) > RM dies because of divide by zero > - > > Key: YARN-3001 > URL: https://issues.apache.org/jira/browse/YARN-3001 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.5.1 >Reporter: hoelog >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-3001.barnch-2.7.patch > > > RM dies because of divide by zero exception. > {code} > 2014-12-31 21:27:05,022 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type NODE_UPDATE to the scheduler > java.lang.ArithmeticException: / by zero > at > org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator.computeAvailableContainers(DefaultResourceCalculator.java:37) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainer(LeafQueue.java:1332) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignOffSwitchContainers(LeafQueue.java:1218) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainersOnNode(LeafQueue.java:1177) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:877) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:656) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:570) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:851) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:900) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:599) > at java.lang.Thread.run(Thread.java:745) > 2014-12-31 21:27:05,023 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org