[jira] [Updated] (YARN-9551) TestTimelineClientV2Impl.testSyncCall fails intermittently

2022-02-14 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated YARN-9551:
---
Fix Version/s: 3.3.2

> TestTimelineClientV2Impl.testSyncCall fails intermittently
> --
>
> Key: YARN-9551
> URL: https://issues.apache.org/jira/browse/YARN-9551
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: ATSv2, test
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Andras Gyori
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.3, 3.3.2, 3.1.5
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> TestTimelineClientV2Impl.testSyncCall fails intermittent
> {code:java}
> Failed
> org.apache.hadoop.yarn.client.api.impl.TestTimelineClientV2Impl.testSyncCall
> Failing for the past 1 build (Since #24083 )
> Took 1.5 sec.
> Error Message
> TimelineEntities not published as desired expected:<3> but was:<4>
> Stacktrace
> java.lang.AssertionError: TimelineEntities not published as desired 
> expected:<3> but was:<4>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestTimelineClientV2Impl.testSyncCall(TestTimelineClientV2Impl.java:251)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> Standard Output
> 2019-05-13 15:33:46,596 WARN  [main] util.NativeCodeLoader 
> (NativeCodeLoader.java:(60)) - Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 2019-05-13 15:33:47,763 INFO  [main] impl.TestTimelineClientV2Impl 
> (TestTimelineClientV2Impl.java:printReceivedEntities(413)) - Entities 
> Published @ index 0 : 1,
> 2019-05-13 15:33:47,764 INFO  [main] impl.TestTimelineClientV2Impl 
> (TestTimelineClientV2Impl.java:printReceivedEntities(413)) - Entities 
> Published @ index 1 : 2,
> 2019-05-13 15:33:47,764 INFO  [main] impl.TestTimelineClientV2Impl 
> (TestTimelineClientV2Impl.java:printReceivedEntities(413)) - Entities 
> Published @ index 2 : 3,
> 2019-05-13 15:33:47,764 INFO  [main] impl.TestTimelineClientV2Impl 
> (TestTimelineClientV2Impl.java:printReceivedEntities(413)) - Entities 
> Published @ index 3 : 4,
> 2019-05-13 15:33:47,765 INFO  [main] imp

[jira] [Updated] (YARN-10991) Fix to ignore the grouping "[]" for resourcesStr in parseResourcesString method

2022-02-14 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated YARN-10991:

Fix Version/s: 3.3.2
   (was: 3.3.3)

> Fix to ignore the grouping "[]" for resourcesStr in parseResourcesString 
> method
> ---
>
> Key: YARN-10991
> URL: https://issues.apache.org/jira/browse/YARN-10991
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: distributed-shell
>Affects Versions: 3.3.1
>Reporter: Ashutosh Gupta
>Assignee: Ashutosh Gupta
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.2, 3.2.4
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Currently if the resourcesStr argument in parseResourcesString method 
> contains "]" at the end, its not being ignored. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11007) Correct words in YARN documents

2022-02-14 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated YARN-11007:

Fix Version/s: 3.3.2
   (was: 3.3.3)

> Correct words in YARN documents
> ---
>
> Key: YARN-11007
> URL: https://issues.apache.org/jira/browse/YARN-11007
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 3.3.1
>Reporter: guophilipse
>Assignee: guophilipse
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.2, 3.2.4
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11057) NodeManager may generate too many empty log dirs when we configure many log dirs

2022-01-05 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated YARN-11057:

Target Version/s: 3.3.3  (was: 3.3.2)

> NodeManager may generate too many empty log dirs when we configure many log 
> dirs
> 
>
> Key: YARN-11057
> URL: https://issues.apache.org/jira/browse/YARN-11057
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation, nodemanager
>Affects Versions: 2.7.7, 3.3.1
>Reporter: Yao Guangdong
>Assignee: Yao Guangdong
>Priority: Major
> Attachments: YARN-11507.0001.patch
>
>
>   NodeManager may generate too many empty log dirs when we configure many log 
> dirs in NonAggregationLogHandler mode.For example: We have 24 disks, 512G 
> memory,hypothesis that average time cost is 1 min for every container  and 
> average container's size is 4g.Then parallel running containers in one server 
> are 512G / 4G = 128. Every container will generate more than 24 directories 
> in current policy.Then total directories in one week is 128 * 24 * (60 * 24 * 
> 7) = 30 965 760 .This is not conside the tmp directories. Which will consume 
> too many inods in server and affect the disk's io utils.This is because so 
> many inodes will consume too many memory cached in linux.When the memory is 
> not insufficience the cached inodes will remove from the memory.Which will 
> increase the incidence of scan disk and the disk io utils will become 
> high.Actually, this directories only one is used for container logs for every 
> container. The others is empty.So we can delete the empty directories when 
> the job is finished.Which will reduce too many inodes.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9063) ATS 1.5 fails to start if RollingLevelDb files are corrupt or missing

2022-01-04 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated YARN-9063:
---
Fix Version/s: 3.3.2
   (was: 3.3.3)

> ATS 1.5 fails to start if RollingLevelDb files are corrupt or missing
> -
>
> Key: YARN-9063
> URL: https://issues.apache.org/jira/browse/YARN-9063
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver, timelineservice
>Affects Versions: 2.8.0
>Reporter: Tarun Parimi
>Assignee: Ashutosh Gupta
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 2.10.2, 3.3.2, 3.2.4
>
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> ATS v1.5 fails to start up if there are some missing files in 
> RollingLevelDBTimelineStore. YARN-6054 fixes this issue only for the 
> LevelDBTimelineStore. Since RollingLevelDBTimelineStore opens multiple level 
> db and rolls them, we need a separate fix for this. The error is shown below
> {code}
> 18/11/13 07:00:56 FATAL applicationhistoryservice.ApplicationHistoryServer: 
> Error starting ApplicationHistoryServer 
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 
> missing files; e.g.: 
> /tmp/ats_folder/yarn/timeline/leveldb-timeline-store/owner-ldb/05.sst 
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>  
> at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) 
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
>  
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:111)
>  
> at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) 
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:174)
>  
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:184)
>  
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: 
> Corruption: 1 missing files; e.g.: 
> /tmp/ats-folder/yarn/timeline/leveldb-timeline-store/owner-ldb/05.sst 
> at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) 
> at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) 
> at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) 
> at 
> org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321)
>  
> at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10820) Make GetClusterNodesRequestPBImpl thread safe

2022-01-04 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated YARN-10820:

Fix Version/s: 3.3.2
   (was: 3.3.3)

> Make GetClusterNodesRequestPBImpl thread safe
> -
>
> Key: YARN-10820
> URL: https://issues.apache.org/jira/browse/YARN-10820
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: client
>Affects Versions: 3.1.0, 3.3.0
>Reporter: Prabhu Joseph
>Assignee: SwathiChandrashekar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.2, 3.2.4
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> yarn node list intermittently fails with below
> {code:java}
> 2021-06-13 11:26:42,316 WARN client.RequestHedgingRMFailoverProxyProvider: 
> Invocation returned exception: java.lang.ArrayIndexOutOfBoundsException: 1 on 
> [resourcemanager-1], so propagating back to caller.
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
>  at java.util.ArrayList.add(ArrayList.java:465)
>  at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009)
>  at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124)
>  at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82)
>  at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>  at com.sun.proxy.$Proxy8.getClusterNodes(Unknown Source)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider$RMRequestHedgingInvocationHandler$1.call(RequestHedgingRMFailoverProxyProvider.java:159)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> 2021-06-13 11:27:58,415 WARN client.RequestHedgingRMFailoverProxyProvider: 
> Invocation returned exception: java.lang.UnsupportedOperationException on 
> [resourcemanager-0], so propagating back to caller.
> Exception in thread "main" java.lang.UnsupportedOperationException
> at 
> java.util.Collections$UnmodifiableCollection.add(Collections.java:1057)
> at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(Nativ

[jira] [Updated] (YARN-11020) [UI2] No container is found for an application attempt with a single AM container

2022-01-04 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated YARN-11020:

Fix Version/s: 3.3.2
   (was: 3.3.3)

> [UI2] No container is found for an application attempt with a single AM 
> container
> -
>
> Key: YARN-11020
> URL: https://issues.apache.org/jira/browse/YARN-11020
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.2
>
> Attachments: YARN-11020-branch-3.3.001.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> In UI2 for an application under the Logs tab, No container data available 
> message is shown if the application attempt only submitted a single container 
> (which is the AM container). 
> The culprit of the issue is that the response from YARN is not consistent, 
> because for a single container it looks like:
> {noformat}
> {
>     "containerLogsInfo": {
>         "containerLogInfo": [
>             {
>                 "fileName": "prelaunch.out",
>                 "fileSize": "100",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "directory.info",
>                 "fileSize": "2296",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "stderr",
>                 "fileSize": "1722",
>                 "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021"
>             },
>             {
>                 "fileName": "prelaunch.err",
>                 "fileSize": "0",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "stdout",
>                 "fileSize": "0",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "syslog",
>                 "fileSize": "38551",
>                 "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021"
>             },
>             {
>                 "fileName": "launch_container.sh",
>                 "fileSize": "5013",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             }
>         ],
>         "logAggregationType": "AGGREGATED",
>         "containerId": "container_1638174027957_0008_01_01",
>         "nodeId": "da175178c179:43977"
>     }
> }{noformat}
> As for applications with multiple containers it looks like:
> {noformat}
> {
>     "containerLogsInfo": [{
>         
>     }, {  }]
> }{noformat}
> We can not change the response of the endpoint due to backward compatibility, 
> therefore we need to make UI2 be able to handle both scenarios.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6862) Nodemanager resource usage metrics sometimes are negative

2022-01-04 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-6862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated YARN-6862:
---
Fix Version/s: 3.3.2
   (was: 3.3.3)

> Nodemanager resource usage metrics sometimes are negative
> -
>
> Key: YARN-6862
> URL: https://issues.apache.org/jira/browse/YARN-6862
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.2
>Reporter: YunFan Zhou
>Assignee: Benjamin Teke
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.2, 3.2.4
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> When we collect real-time metrics of resource usage in NM, we found those 
> values sometimes are invalid.
> For example, the following are values when collected at some point:
> "milliVcoresUsed":-5808,
> "currentPmemUsage":-1,
> "currentVmemUsage":-1,
> "cpuUsagePercentPerCore":-968.1026
> "cpuUsageTotalCoresPercentage":-24.202564,
> "pmemLimit":2147483648,
> "vmemLimit":4509715456
> There are many negative values,  there may a bug in NM. 
> We should fix it, because the real-time metrics of NM is pretty important for 
> us sometimes.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract

2022-01-04 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated YARN-10178:

Fix Version/s: 3.3.2
   (was: 3.3.3)

> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract
> --
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Andras Gyori
>Priority: Major
> Fix For: 3.4.0, 2.10.2, 3.3.2, 3.2.4
>
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch, 
> YARN-10178.006.patch, YARN-10178.branch-2.10.001.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw 
> 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know 
> Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity Scheduler Global Scheduler AsyncThread use 
> PriorityUtilizationQueueOrderingPolicy function to choose queue to assign 
> container,and construct a CSAssignment struct, and use 
> submitResourceCommitRequest function add CSAssignment to backlogs
> ResourceCommitterService  will tryCommit this CSAssignment,look tryCommit 
> function,there will update queue resource usage
> {code:java}
> public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
> boolean updatePending) {
>   long commitStart = System.nanoTime();
>   ResourceCommitRequest request =
>   (ResourceCommitRequest) r;
>  
>   ...
>   boolean isSuccess = false;
>   if (attemptId != null) {
> FiCaSchedulerApp app = getApplicationAttempt(attemptId);
> // Required sanity check for attemptId - when async-scheduling enabled,
> // proposal might be outdated if AM failover just finished
> // and proposal queue was not be consumed in time
> if (app != null && attemptId.equals(app.getApplicationAttemptId())) {
>   if (a

[jira] [Updated] (YARN-8234) Improve RM system metrics publisher's performance by pushing events to timeline server in batch

2022-01-04 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated YARN-8234:
---
Fix Version/s: 3.3.2
   (was: 3.3.3)

> Improve RM system metrics publisher's performance by pushing events to 
> timeline server in batch
> ---
>
> Key: YARN-8234
> URL: https://issues.apache.org/jira/browse/YARN-8234
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.8.3
>Reporter: Hu Ziqian
>Assignee: Ashutosh Gupta
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 2.10.2, 3.3.2, 3.2.4
>
> Attachments: YARN-8234-branch-2.8.3.001.patch, 
> YARN-8234-branch-2.8.3.002.patch, YARN-8234-branch-2.8.3.003.patch, 
> YARN-8234-branch-2.8.3.004.patch, YARN-8234.001.patch, YARN-8234.002.patch, 
> YARN-8234.003.patch, YARN-8234.004.patch
>
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> When system metrics publisher is enabled, RM will push events to timeline 
> server via restful api. If the cluster load is heavy, many events are sent to 
> timeline server and the timeline server's event handler thread locked. 
> YARN-7266 talked about the detail of this problem. Because of the lock, 
> timeline server can't receive event as fast as it generated in RM and lots of 
> timeline event stays in RM's memory. Finally, those events will consume all 
> RM's memory and RM will start a full gc (which cause an JVM stop-world and 
> cause a timeout from rm to zookeeper) or even get an OOM. 
> The main problem here is that timeline can't receive timeline server's event 
> as fast as it generated. Now, RM system metrics publisher put only one event 
> in a request, and most time costs on handling http header or some thing about 
> the net connection on timeline side. Only few time is spent on dealing with 
> the timeline event which is truly valuable.
> In this issue, we add a buffer in system metrics publisher and let publisher 
> send events to timeline server in batch via one request. When sets the batch 
> size to 1000, in out experiment the speed of the timeline server receives 
> events has 100x improvement. We have implement this function int our product 
> environment which accepts 2 app's in one hour and it works fine.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10444) use openFile() with sequential IO for localizing files.

2021-11-08 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated YARN-10444:

Target Version/s: 3.3.3  (was: 3.3.2)

> use openFile() with sequential IO for localizing files.
> ---
>
> Key: YARN-10444
> URL: https://issues.apache.org/jira/browse/YARN-10444
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.3.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
>
> HADOOP-16202 adds standard options for declaring the read/seek
> Policy when reading a file. These should be set to sequential IO
> When localising resources, so that if the default/cluster settings
> For a file system are optimized for random IO, artifact downloads
> are still read at the maximum speed possible (one big GET to the EOF).
> Most of this happens in hadoop-common, but some tuning of FSDownload
> can assist
> * tar/jar download must also be sequential
> * if the FileStatus is passed around, that can be used
>   in the open request to skip checks when loading the file.
>   
> Together this can save 3 HEAD requests per resource, with the sequential
> IO avoiding any splitting of the big read into separate block GETs



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10510) TestAppPage.testAppBlockRenderWithNullCurrentAppAttempt will cause NullPointerException

2021-11-08 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated YARN-10510:

Target Version/s: 3.3.3  (was: 3.3.2)

> TestAppPage.testAppBlockRenderWithNullCurrentAppAttempt  will cause 
> NullPointerException
> 
>
> Key: YARN-10510
> URL: https://issues.apache.org/jira/browse/YARN-10510
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: test
>Reporter: tuyu
>Priority: Minor
>  Labels: test
> Attachments: YARN-10510.001.patch
>
>
> run TestAppPage.testAppBlockRenderWithNullCurrentAppAttempt will cause blow 
> exception
> {code:java}
> 2020-12-01 20:16:41,412 ERROR [main] webapp.AppBlock 
> (AppBlock.java:render(124)) - Failed to read the application 
> application_1234_.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMAppBlock.getApplicationReport(RMAppBlock.java:218)
>   at 
> org.apache.hadoop.yarn.server.webapp.AppBlock.render(AppBlock.java:112)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMAppBlock.render(RMAppBlock.java:71)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestAppPage.testAppBlockRenderWithNullCurrentAppAttempt(TestAppPage.java:92)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
>   at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69)
>   at 
> com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:33)
>   at 
> com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:220)
>   at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:53)
> Disconnected from the target VM, address: '127.0.0.1:60623', transport: 
> 'socket'
> {code}
> because of  mockClientRMService not mock getApplicationReport and 
> getApplicationAttempts interface



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10124) Remove restriction of ParentQueue capacity zero when childCapacities > 0

2021-11-08 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated YARN-10124:

Target Version/s: 3.3.3  (was: 3.3.2)

> Remove restriction of ParentQueue capacity zero when childCapacities > 0
> 
>
> Key: YARN-10124
> URL: https://issues.apache.org/jira/browse/YARN-10124
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10124-001.patch, YARN-10124-002.patch
>
>
> ParentQueue capacity cannot be set to 0 when child capacities > 0. To disable 
> a parent queue temporarily, user can only STOP the queue but the capacity of 
> the queue cannot be used for other queues. Allowing 0 capacity for parent 
> queue will allow user to use the capacity for other queues and also to retain 
> the child queue capacity values. (else user has to set all child queue 
> capacities to 0)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10138) Document the new JHS API

2021-11-08 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated YARN-10138:

Target Version/s: 3.4.0, 3.3.3  (was: 3.4.0, 3.3.2)

> Document the new JHS API
> 
>
> Key: YARN-10138
> URL: https://issues.apache.org/jira/browse/YARN-10138
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Major
>
> A new API has been introduced in YARN-10028, but we did not document it in 
> the JHS API documentation. Let's add it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10324) Fetch data from NodeManager may case read timeout when disk is busy

2021-11-08 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated YARN-10324:

Target Version/s: 3.3.3, 2.7.8  (was: 2.7.8, 3.3.2)

> Fetch data from NodeManager may case read timeout when disk is busy
> ---
>
> Key: YARN-10324
> URL: https://issues.apache.org/jira/browse/YARN-10324
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: auxservices
>Affects Versions: 2.7.0, 3.2.1
>Reporter: Yao Guangdong
>Assignee: Yao Guangdong
>Priority: Minor
>  Labels: patch
> Attachments: YARN-10324.001.patch, YARN-10324.002.patch, 
> YARN-10324.003.patch, image-2021-05-21-17-48-03-476.png
>
>
>  With the cluster size become more and more big.The cost  time on Reduce 
> fetch Map's result from NodeManager become more and more long.We often see 
> the WARN logs in the reduce's logs as follow.
> {quote}2020-06-19 15:43:15,522 WARN [fetcher#8] 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
> TX-196-168-211.com:13562 with 5 map outputs
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:171)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)
> at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.verifyConnection(Fetcher.java:434)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.setupConnectionsWithRetry(Fetcher.java:400)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.openShuffleUrl(Fetcher.java:271)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:330)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:198)
> {quote}
>  We check the NodeManager server find that the disk IO util and connections 
> became very high when the read timeout happened.We analyze that if we have 
> 20,000 maps and 1,000 reduces which will make NodeManager generate 20 million 
> times IO stream operate in the shuffle phase.If the reduce fetch data size is 
> very small from map output files.Which make the disk IO util become very high 
> in big cluster.Then read timeout happened frequently.The application finished 
> time become longer.
> We find ShuffleHandler have IndexCache for cache file.out.index file.Then we 
> want to change the small IO to big IO which can reduce the small disk IO 
> times. So we try to cache all the small file data(file.out) in memory when 
> the first fetch request come.Then the others fetch request only need read 
> data from memory avoid disk IO operation.After we cache data to memory we 
> find the read timeout disappeared.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6862) Nodemanager resource usage metrics sometimes are negative

2021-11-08 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-6862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated YARN-6862:
---
Target Version/s: 3.3.3  (was: 3.3.2)

> Nodemanager resource usage metrics sometimes are negative
> -
>
> Key: YARN-6862
> URL: https://issues.apache.org/jira/browse/YARN-6862
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.2
>Reporter: YunFan Zhou
>Assignee: Benjamin Teke
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> When we collect real-time metrics of resource usage in NM, we found those 
> values sometimes are invalid.
> For example, the following are values when collected at some point:
> "milliVcoresUsed":-5808,
> "currentPmemUsage":-1,
> "currentVmemUsage":-1,
> "cpuUsagePercentPerCore":-968.1026
> "cpuUsageTotalCoresPercentage":-24.202564,
> "pmemLimit":2147483648,
> "vmemLimit":4509715456
> There are many negative values,  there may a bug in NM. 
> We should fix it, because the real-time metrics of NM is pretty important for 
> us sometimes.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10660) YARN Web UI have problem when show node partitions resource

2021-11-08 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated YARN-10660:

Target Version/s: 3.2.4, 3.3.3  (was: 3.3.2, 3.2.4)

> YARN Web UI have problem when show node partitions resource
> ---
>
> Key: YARN-10660
> URL: https://issues.apache.org/jira/browse/YARN-10660
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Affects Versions: 3.1.0, 3.1.1, 3.2.1, 3.2.2
>Reporter: tuyu
>Priority: Minor
> Attachments: 2021-03-01 19-56-02 的屏幕截图.png, YARN-10660.patch
>
>
> when enable yarn label function, Yarn UI will show queue resource base on 
> partitions,but there have some problem when click expand button. The url will 
> increase very long, like  this 
> {code:java}
> 127.0.0.1:20701/cluster/scheduler?openQueues=Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20
> {code}
> The root cause is
> {code:java}
>origin url is:
>   Partition:  
>htmlencode is:
>   Partition:  
>   SchedulerPageUtil have some javascript code
>  storeExpandedQueue
> tmpCurrentParam = tmpCurrentParam.split('&');",
>the  Partition:   
>  will split and len > 1, the problem logic is here, if click  expand button 
> close, the function will clear params, but it the split array is not match 
> orgin url 
> {code}
> when click expand button close, lt;DEFAULT_PARTITION>  vCores:96>  will append, if click expand multi times, the length will 
> increase too long
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10977) AppMaster register UAM can lost application priority in subCluster

2021-11-08 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated YARN-10977:

Target Version/s: 3.3.3  (was: 3.3.2)

> AppMaster register UAM can lost application priority in subCluster
> --
>
> Key: YARN-10977
> URL: https://issues.apache.org/jira/browse/YARN-10977
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: amrmproxy, nodemanager, resourcemanager
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Yao Guangdong
>Assignee: Yao Guangdong
>Priority: Major
> Attachments: YARN-10977.0001.patch
>
>
> AppMaster register uam to subCluster can lost application priority in yarn 
> federation cluster.Which make the subCluster's RM allocate resouce to 
> application use default priority regardless of the application's real 
> priority.
> By analyzing the code, I found that the appMaster submitApplication to 
> subCluster did't set the priority to ApplicationSubmissionContext. As follows:
> {code:java}
> private void submitUnmanagedApp(ApplicationId appId)
>  throws YarnException, IOException {
>  SubmitApplicationRequest submitRequest =
>  this.recordFactory.newRecordInstance(SubmitApplicationRequest.class);
> ApplicationSubmissionContext context = this.recordFactory
>  .newRecordInstance(ApplicationSubmissionContext.class);
> context.setApplicationId(appId);
>  context.setApplicationName(APP_NAME + "-" + appNameSuffix);
>  if (StringUtils.isBlank(this.queueName)) {
>  context.setQueue(this.conf.get(DEFAULT_QUEUE_CONFIG,
>  YarnConfiguration.DEFAULT_QUEUE_NAME));
>  } else {
>  context.setQueue(this.queueName);
>  }
> ContainerLaunchContext amContainer =
>  this.recordFactory.newRecordInstance(ContainerLaunchContext.class);
>  Resource resource = BuilderUtils.newResource(1024, 1);
>  context.setResource(resource);
>  context.setAMContainerSpec(amContainer);
>  submitRequest.setApplicationSubmissionContext(context);
> context.setUnmanagedAM(true);
>  context.setKeepContainersAcrossApplicationAttempts(
>  this.keepContainersAcrossApplicationAttempts);
> LOG.info("Submitting unmanaged application {}", appId);
>  this.rmClient.submitApplication(submitRequest);
>  }
> {code}
> Finnally, I fixed this by set application's priority to the 
> ApplicationSubmissionContext.  The priority which from the homeCluster's 
> response when register appMaster to homeCluster's RM.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org