date:20211108

[jira] [Updated] (YARN-10444) use openFile() with sequential IO for localizing files.

2021-11-08 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated YARN-10444:

Target Version/s: 3.3.3  (was: 3.3.2)

> use openFile() with sequential IO for localizing files.
> ---
>
> Key: YARN-10444
> URL: https://issues.apache.org/jira/browse/YARN-10444
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.3.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
>
> HADOOP-16202 adds standard options for declaring the read/seek
> Policy when reading a file. These should be set to sequential IO
> When localising resources, so that if the default/cluster settings
> For a file system are optimized for random IO, artifact downloads
> are still read at the maximum speed possible (one big GET to the EOF).
> Most of this happens in hadoop-common, but some tuning of FSDownload
> can assist
> * tar/jar download must also be sequential
> * if the FileStatus is passed around, that can be used
>   in the open request to skip checks when loading the file.
>   
> Together this can save 3 HEAD requests per resource, with the sequential
> IO avoiding any splitting of the big read into separate block GETs



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10510) TestAppPage.testAppBlockRenderWithNullCurrentAppAttempt will cause NullPointerException

2021-11-08 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated YARN-10510:

Target Version/s: 3.3.3  (was: 3.3.2)

> TestAppPage.testAppBlockRenderWithNullCurrentAppAttempt  will cause 
> NullPointerException
> 
>
> Key: YARN-10510
> URL: https://issues.apache.org/jira/browse/YARN-10510
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: test
>Reporter: tuyu
>Priority: Minor
>  Labels: test
> Attachments: YARN-10510.001.patch
>
>
> run TestAppPage.testAppBlockRenderWithNullCurrentAppAttempt will cause blow 
> exception
> {code:java}
> 2020-12-01 20:16:41,412 ERROR [main] webapp.AppBlock 
> (AppBlock.java:render(124)) - Failed to read the application 
> application_1234_.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMAppBlock.getApplicationReport(RMAppBlock.java:218)
>   at 
> org.apache.hadoop.yarn.server.webapp.AppBlock.render(AppBlock.java:112)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMAppBlock.render(RMAppBlock.java:71)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestAppPage.testAppBlockRenderWithNullCurrentAppAttempt(TestAppPage.java:92)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
>   at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69)
>   at 
> com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:33)
>   at 
> com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:220)
>   at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:53)
> Disconnected from the target VM, address: '127.0.0.1:60623', transport: 
> 'socket'
> {code}
> because of  mockClientRMService not mock getApplicationReport and 
> getApplicationAttempts interface



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10124) Remove restriction of ParentQueue capacity zero when childCapacities > 0

2021-11-08 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated YARN-10124:

Target Version/s: 3.3.3  (was: 3.3.2)

> Remove restriction of ParentQueue capacity zero when childCapacities > 0
> 
>
> Key: YARN-10124
> URL: https://issues.apache.org/jira/browse/YARN-10124
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10124-001.patch, YARN-10124-002.patch
>
>
> ParentQueue capacity cannot be set to 0 when child capacities > 0. To disable 
> a parent queue temporarily, user can only STOP the queue but the capacity of 
> the queue cannot be used for other queues. Allowing 0 capacity for parent 
> queue will allow user to use the capacity for other queues and also to retain 
> the child queue capacity values. (else user has to set all child queue 
> capacities to 0)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10138) Document the new JHS API

2021-11-08 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated YARN-10138:

Target Version/s: 3.4.0, 3.3.3  (was: 3.4.0, 3.3.2)

> Document the new JHS API
> 
>
> Key: YARN-10138
> URL: https://issues.apache.org/jira/browse/YARN-10138
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Major
>
> A new API has been introduced in YARN-10028, but we did not document it in 
> the JHS API documentation. Let's add it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10324) Fetch data from NodeManager may case read timeout when disk is busy

2021-11-08 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated YARN-10324:

Target Version/s: 3.3.3, 2.7.8  (was: 2.7.8, 3.3.2)

> Fetch data from NodeManager may case read timeout when disk is busy
> ---
>
> Key: YARN-10324
> URL: https://issues.apache.org/jira/browse/YARN-10324
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: auxservices
>Affects Versions: 2.7.0, 3.2.1
>Reporter: Yao Guangdong
>Assignee: Yao Guangdong
>Priority: Minor
>  Labels: patch
> Attachments: YARN-10324.001.patch, YARN-10324.002.patch, 
> YARN-10324.003.patch, image-2021-05-21-17-48-03-476.png
>
>
>  With the cluster size become more and more big.The cost  time on Reduce 
> fetch Map's result from NodeManager become more and more long.We often see 
> the WARN logs in the reduce's logs as follow.
> {quote}2020-06-19 15:43:15,522 WARN [fetcher#8] 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
> TX-196-168-211.com:13562 with 5 map outputs
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:171)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)
> at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.verifyConnection(Fetcher.java:434)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.setupConnectionsWithRetry(Fetcher.java:400)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.openShuffleUrl(Fetcher.java:271)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:330)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:198)
> {quote}
>  We check the NodeManager server find that the disk IO util and connections 
> became very high when the read timeout happened.We analyze that if we have 
> 20,000 maps and 1,000 reduces which will make NodeManager generate 20 million 
> times IO stream operate in the shuffle phase.If the reduce fetch data size is 
> very small from map output files.Which make the disk IO util become very high 
> in big cluster.Then read timeout happened frequently.The application finished 
> time become longer.
> We find ShuffleHandler have IndexCache for cache file.out.index file.Then we 
> want to change the small IO to big IO which can reduce the small disk IO 
> times. So we try to cache all the small file data(file.out) in memory when 
> the first fetch request come.Then the others fetch request only need read 
> data from memory avoid disk IO operation.After we cache data to memory we 
> find the read timeout disappeared.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6862) Nodemanager resource usage metrics sometimes are negative

2021-11-08 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-6862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated YARN-6862:
---
Target Version/s: 3.3.3  (was: 3.3.2)

> Nodemanager resource usage metrics sometimes are negative
> -
>
> Key: YARN-6862
> URL: https://issues.apache.org/jira/browse/YARN-6862
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.2
>Reporter: YunFan Zhou
>Assignee: Benjamin Teke
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> When we collect real-time metrics of resource usage in NM, we found those 
> values sometimes are invalid.
> For example, the following are values when collected at some point:
> "milliVcoresUsed":-5808,
> "currentPmemUsage":-1,
> "currentVmemUsage":-1,
> "cpuUsagePercentPerCore":-968.1026
> "cpuUsageTotalCoresPercentage":-24.202564,
> "pmemLimit":2147483648,
> "vmemLimit":4509715456
> There are many negative values,  there may a bug in NM. 
> We should fix it, because the real-time metrics of NM is pretty important for 
> us sometimes.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10660) YARN Web UI have problem when show node partitions resource

2021-11-08 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated YARN-10660:

Target Version/s: 3.2.4, 3.3.3  (was: 3.3.2, 3.2.4)

> YARN Web UI have problem when show node partitions resource
> ---
>
> Key: YARN-10660
> URL: https://issues.apache.org/jira/browse/YARN-10660
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Affects Versions: 3.1.0, 3.1.1, 3.2.1, 3.2.2
>Reporter: tuyu
>Priority: Minor
> Attachments: 2021-03-01 19-56-02 的屏幕截图.png, YARN-10660.patch
>
>
> when enable yarn label function, Yarn UI will show queue resource base on 
> partitions,but there have some problem when click expand button. The url will 
> increase very long, like  this 
> {code:java}
> 127.0.0.1:20701/cluster/scheduler?openQueues=Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20
> {code}
> The root cause is
> {code:java}
>origin url is:
>   Partition:  
>htmlencode is:
>   Partition:  
>   SchedulerPageUtil have some javascript code
>  storeExpandedQueue
> tmpCurrentParam = tmpCurrentParam.split('&');",
>the  Partition:   
>  will split and len > 1, the problem logic is here, if click  expand button 
> close, the function will clear params, but it the split array is not match 
> orgin url 
> {code}
> when click expand button close, lt;DEFAULT_PARTITION>  vCores:96>  will append, if click expand multi times, the length will 
> increase too long
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10977) AppMaster register UAM can lost application priority in subCluster

2021-11-08 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated YARN-10977:

Target Version/s: 3.3.3  (was: 3.3.2)

> AppMaster register UAM can lost application priority in subCluster
> --
>
> Key: YARN-10977
> URL: https://issues.apache.org/jira/browse/YARN-10977
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: amrmproxy, nodemanager, resourcemanager
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Yao Guangdong
>Assignee: Yao Guangdong
>Priority: Major
> Attachments: YARN-10977.0001.patch
>
>
> AppMaster register uam to subCluster can lost application priority in yarn 
> federation cluster.Which make the subCluster's RM allocate resouce to 
> application use default priority regardless of the application's real 
> priority.
> By analyzing the code, I found that the appMaster submitApplication to 
> subCluster did't set the priority to ApplicationSubmissionContext. As follows:
> {code:java}
> private void submitUnmanagedApp(ApplicationId appId)
>  throws YarnException, IOException {
>  SubmitApplicationRequest submitRequest =
>  this.recordFactory.newRecordInstance(SubmitApplicationRequest.class);
> ApplicationSubmissionContext context = this.recordFactory
>  .newRecordInstance(ApplicationSubmissionContext.class);
> context.setApplicationId(appId);
>  context.setApplicationName(APP_NAME + "-" + appNameSuffix);
>  if (StringUtils.isBlank(this.queueName)) {
>  context.setQueue(this.conf.get(DEFAULT_QUEUE_CONFIG,
>  YarnConfiguration.DEFAULT_QUEUE_NAME));
>  } else {
>  context.setQueue(this.queueName);
>  }
> ContainerLaunchContext amContainer =
>  this.recordFactory.newRecordInstance(ContainerLaunchContext.class);
>  Resource resource = BuilderUtils.newResource(1024, 1);
>  context.setResource(resource);
>  context.setAMContainerSpec(amContainer);
>  submitRequest.setApplicationSubmissionContext(context);
> context.setUnmanagedAM(true);
>  context.setKeepContainersAcrossApplicationAttempts(
>  this.keepContainersAcrossApplicationAttempts);
> LOG.info("Submitting unmanaged application {}", appId);
>  this.rmClient.submitApplication(submitRequest);
>  }
> {code}
> Finnally, I fixed this by set application's priority to the 
> ApplicationSubmissionContext.  The priority which from the homeCluster's 
> response when register appMaster to homeCluster's RM.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10822) Containers going from New to Scheduled transition for killed container on recovery

2021-11-08 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YARN-10822:
--
Labels: pull-request-available  (was: )

> Containers going from New to Scheduled transition for killed container on 
> recovery
> --
>
> Key: YARN-10822
> URL: https://issues.apache.org/jira/browse/YARN-10822
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Minni Mittal
>Assignee: Minni Mittal
>Priority: Major
>  Labels: pull-request-available
> Attachments: YARN-10822.v1.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> INFO  [91] ContainerImpl: Container 
> container_e1171_1623422468672_2229_01_000738 transitioned from NEW to 
> LOCALIZING
> INFO  [91] ContainerImpl: Container 
> container_e1171_1623422468672_2229_01_000738 transitioned from LOCALIZING to 
> SCHEDULED
> INFO  [91] ContainerScheduler: Opportunistic container 
> container_e1171_1623422468672_2229_01_000738 will be queued at the NM.
> INFO  [127] ContainerManagerImpl: Stopping container with container Id: 
> container_e1171_1623422468672_2229_01_000738
> INFO  [91] ContainerImpl: Container 
> container_e1171_1623422468672_2229_01_000738 transitioned from SCHEDULED to 
> KILLING
> INFO  [91] ContainerImpl: Container 
> container_e1171_1623422468672_2229_01_000738 transitioned from KILLING to 
> CONTAINER_CLEANEDUP_AFTER_KILL
> INFO  [91] NMAuditLogger: USER=defaultcafor1stparty OPERATION=Container 
> Finished - Killed TARGET=ContainerImpl RESULT=SUCCESS 
> APPID=application_1623422468672_2229 
> CONTAINERID=container_e1171_1623422468672_2229_01_000738
> INFO  [91] ApplicationImpl: Removing 
> container_e1171_1623422468672_2229_01_000738 from application 
> application_1623422468672_2229
> INFO  [91] ContainersMonitorImpl: Stopping resource-monitoring for 
> container_e1171_1623422468672_2229_01_000738
> INFO  [163] NodeStatusUpdaterImpl: Removed completed containers from NM 
> context:[container_e1171_1623422468672_2229_01_000738]
> NM restart happened and recovery is attempted
>  
> INFO  [1] ContainerManagerImpl: Recovering 
> container_e1171_1623422468672_2229_01_000738 in state QUEUED with exit code 
> -1000
> INFO  [1] ApplicationImpl: Adding 
> container_e1171_1623422468672_2229_01_000738 to application 
> application_1623422468672_2229
> INFO  [89] ContainerImpl: Container 
> container_e1171_1623422468672_2229_01_000738 transitioned from NEW to 
> SCHEDULED
> INFO  [89] ContainerImpl: Container 
> container_e1171_1623422468672_2229_01_000738 transitioned from SCHEDULED to 
> KILLING
> INFO  [89] ContainerImpl: Container 
> container_e1171_1623422468672_2229_01_000738 transitioned from KILLING to 
> CONTAINER_CLEANEDUP_AFTER_KILL
> Ideally, when container got killed before restart, it should finish the 
> container immediately. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10822) Containers going from New to Scheduled transition for killed container on recovery

2021-11-08 Thread Minni Mittal (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Minni Mittal updated YARN-10822:

Summary: Containers going from New to Scheduled transition for killed 
container on recovery  (was: Containers going from New to Scheduled transition 
even though container is killed before NM restart when NM recovery is enabled)

> Containers going from New to Scheduled transition for killed container on 
> recovery
> --
>
> Key: YARN-10822
> URL: https://issues.apache.org/jira/browse/YARN-10822
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Minni Mittal
>Assignee: Minni Mittal
>Priority: Major
> Attachments: YARN-10822.v1.patch
>
>
> INFO  [91] ContainerImpl: Container 
> container_e1171_1623422468672_2229_01_000738 transitioned from NEW to 
> LOCALIZING
> INFO  [91] ContainerImpl: Container 
> container_e1171_1623422468672_2229_01_000738 transitioned from LOCALIZING to 
> SCHEDULED
> INFO  [91] ContainerScheduler: Opportunistic container 
> container_e1171_1623422468672_2229_01_000738 will be queued at the NM.
> INFO  [127] ContainerManagerImpl: Stopping container with container Id: 
> container_e1171_1623422468672_2229_01_000738
> INFO  [91] ContainerImpl: Container 
> container_e1171_1623422468672_2229_01_000738 transitioned from SCHEDULED to 
> KILLING
> INFO  [91] ContainerImpl: Container 
> container_e1171_1623422468672_2229_01_000738 transitioned from KILLING to 
> CONTAINER_CLEANEDUP_AFTER_KILL
> INFO  [91] NMAuditLogger: USER=defaultcafor1stparty OPERATION=Container 
> Finished - Killed TARGET=ContainerImpl RESULT=SUCCESS 
> APPID=application_1623422468672_2229 
> CONTAINERID=container_e1171_1623422468672_2229_01_000738
> INFO  [91] ApplicationImpl: Removing 
> container_e1171_1623422468672_2229_01_000738 from application 
> application_1623422468672_2229
> INFO  [91] ContainersMonitorImpl: Stopping resource-monitoring for 
> container_e1171_1623422468672_2229_01_000738
> INFO  [163] NodeStatusUpdaterImpl: Removed completed containers from NM 
> context:[container_e1171_1623422468672_2229_01_000738]
> NM restart happened and recovery is attempted
>  
> INFO  [1] ContainerManagerImpl: Recovering 
> container_e1171_1623422468672_2229_01_000738 in state QUEUED with exit code 
> -1000
> INFO  [1] ApplicationImpl: Adding 
> container_e1171_1623422468672_2229_01_000738 to application 
> application_1623422468672_2229
> INFO  [89] ContainerImpl: Container 
> container_e1171_1623422468672_2229_01_000738 transitioned from NEW to 
> SCHEDULED
> INFO  [89] ContainerImpl: Container 
> container_e1171_1623422468672_2229_01_000738 transitioned from SCHEDULED to 
> KILLING
> INFO  [89] ContainerImpl: Container 
> container_e1171_1623422468672_2229_01_000738 transitioned from KILLING to 
> CONTAINER_CLEANEDUP_AFTER_KILL
> Ideally, when container got killed before restart, it should finish the 
> container immediately. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10474) [JDK 12] TestAsyncDispatcher fails

2021-11-08 Thread Minni Mittal (Jira)

[jira] [Updated] (YARN-10444) use openFile() with sequential IO for localizing files.

[jira] [Updated] (YARN-10510) TestAppPage.testAppBlockRenderWithNullCurrentAppAttempt will cause NullPointerException

[jira] [Updated] (YARN-10124) Remove restriction of ParentQueue capacity zero when childCapacities > 0

[jira] [Updated] (YARN-10138) Document the new JHS API

[jira] [Updated] (YARN-10324) Fetch data from NodeManager may case read timeout when disk is busy

[jira] [Updated] (YARN-6862) Nodemanager resource usage metrics sometimes are negative

[jira] [Updated] (YARN-10660) YARN Web UI have problem when show node partitions resource

[jira] [Updated] (YARN-10977) AppMaster register UAM can lost application priority in subCluster

[jira] [Updated] (YARN-10822) Containers going from New to Scheduled transition for killed container on recovery

[jira] [Updated] (YARN-10822) Containers going from New to Scheduled transition for killed container on recovery

[jira] [Commented] (YARN-10474) [JDK 12] TestAsyncDispatcher fails

11 matches

Site Navigation

Mail list logo

Footer information