[jira] [Commented] (YARN-10721) YARN Service containers are restarted when RM failover
[ https://issues.apache.org/jira/browse/YARN-10721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314485#comment-17314485 ] kyungwan nam commented on YARN-10721: - I've seen the attached patch solves this problem in the my cluster. [~csingh], [~billie] I believe this issue is related with YARN-6168, YARN-7565. Can you take a look at this issue Thanks > YARN Service containers are restarted when RM failover > -- > > Key: YARN-10721 > URL: https://issues.apache.org/jira/browse/YARN-10721 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-10721.001.patch, YARN-10721.002.patch > > > Our cluster has a large number of NMs. > When RM failover, it took 7 minutes for most of NMs to register with RM. > After, I’ve seen that a lot of containers was restarted > I think it related with YARN-6168. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10721) YARN Service containers are restarted when RM failover
[ https://issues.apache.org/jira/browse/YARN-10721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-10721: Attachment: YARN-10721.002.patch > YARN Service containers are restarted when RM failover > -- > > Key: YARN-10721 > URL: https://issues.apache.org/jira/browse/YARN-10721 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-10721.001.patch, YARN-10721.002.patch > > > Our cluster has a large number of NMs. > When RM failover, it took 7 minutes for most of NMs to register with RM. > After, I’ve seen that a lot of containers was restarted > I think it related with YARN-6168. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10721) YARN Service containers are restarted when RM failover
[ https://issues.apache.org/jira/browse/YARN-10721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-10721: Description: Our cluster has a large number of NMs. When RM failover, it took 7 minutes for most of NMs to register with RM. After, I’ve seen that a lot of containers was restarted I think it related with YARN-6168. was: Our cluster has a large number of NMs. When RM failover, it took 7 minutes for most of NMs to register with RM. After, I’ve seen that a lot of containers was restarted > YARN Service containers are restarted when RM failover > -- > > Key: YARN-10721 > URL: https://issues.apache.org/jira/browse/YARN-10721 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-10721.001.patch > > > Our cluster has a large number of NMs. > When RM failover, it took 7 minutes for most of NMs to register with RM. > After, I’ve seen that a lot of containers was restarted > I think it related with YARN-6168. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10721) YARN Service containers are restarted when RM failover
[ https://issues.apache.org/jira/browse/YARN-10721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam reassigned YARN-10721: --- Attachment: YARN-10721.001.patch Assignee: kyungwan nam > YARN Service containers are restarted when RM failover > -- > > Key: YARN-10721 > URL: https://issues.apache.org/jira/browse/YARN-10721 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-10721.001.patch > > > Our cluster has a large number of NMs. > When RM failover, it took 7 minutes for most of NMs to register with RM. > After, I’ve seen that a lot of containers was restarted -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10721) YARN Service containers are restarted when RM failover
kyungwan nam created YARN-10721: --- Summary: YARN Service containers are restarted when RM failover Key: YARN-10721 URL: https://issues.apache.org/jira/browse/YARN-10721 Project: Hadoop YARN Issue Type: Bug Reporter: kyungwan nam Our cluster has a large number of NMs. When RM failover, it took 7 minutes for most of NMs to register with RM. After, I’ve seen that a lot of containers was restarted -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10603) Failed to reinitialize for recovered container
[ https://issues.apache.org/jira/browse/YARN-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276087#comment-17276087 ] kyungwan nam edited comment on YARN-10603 at 2/1/21, 6:33 AM: -- I've attached a patch. this patch works well in our cluster. Please review and comment. Thanks. was (Author: kyungwan nam): I've attached a patch. Please review and comment. Thanks > Failed to reinitialize for recovered container > -- > > Key: YARN-10603 > URL: https://issues.apache.org/jira/browse/YARN-10603 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-10603.001.patch > > > Container reinitializing request does not work after restarting NM. > I found some problem as below. > - when a recovered container is terminated, exiting occurs because it makes > always either CONTAINER_EXITED_WITH_FAILURE or CONTAINER_EXITED_WITH_SUCCESS > - container’s *recoveredStatus* is set at the time of NM recovery. and it is > never changed even though the container is terminated. > as a result, newly reinitializing container will be launched as a recovered > container, but it doesn't work -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10603) Failed to reinitialize for recovered container
[ https://issues.apache.org/jira/browse/YARN-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-10603: Attachment: YARN-10603.001.patch > Failed to reinitialize for recovered container > -- > > Key: YARN-10603 > URL: https://issues.apache.org/jira/browse/YARN-10603 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-10603.001.patch > > > Container reinitializing request does not work after restarting NM. > I found some problem as below. > - when a recovered container is terminated, exiting occurs because it makes > always either CONTAINER_EXITED_WITH_FAILURE or CONTAINER_EXITED_WITH_SUCCESS > - container’s *recoveredStatus* is set at the time of NM recovery. and it is > never changed even though the container is terminated. > as a result, newly reinitializing container will be launched as a recovered > container, but it doesn't work -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10603) Failed to reinitialize for recovered container
kyungwan nam created YARN-10603: --- Summary: Failed to reinitialize for recovered container Key: YARN-10603 URL: https://issues.apache.org/jira/browse/YARN-10603 Project: Hadoop YARN Issue Type: Bug Reporter: kyungwan nam Assignee: kyungwan nam Container reinitializing request does not work after restarting NM. I found some problem as below. - when a recovered container is terminated, exiting occurs because it makes always either CONTAINER_EXITED_WITH_FAILURE or CONTAINER_EXITED_WITH_SUCCESS - container’s *recoveredStatus* is set at the time of NM recovery. and it is never changed even though the container is terminated. as a result, newly reinitializing container will be launched as a recovered container, but it doesn't work -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10567) Support parallelism for YARN Service
[ https://issues.apache.org/jira/browse/YARN-10567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-10567: Attachment: YARN-10567.001.patch > Support parallelism for YARN Service > > > Key: YARN-10567 > URL: https://issues.apache.org/jira/browse/YARN-10567 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: kyungwan nam >Priority: Major > Attachments: YARN-10567.001.patch > > > YARN Service support job-like by using "restart_policy" introduced in > YARN-8080. > But, we cannot set how many containers can be launched concurrently. > This feature is something like "parallelism" in kubernetes. > https://kubernetes.io/docs/concepts/workloads/controllers/job/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10567) Support parallelism for YARN Service
kyungwan nam created YARN-10567: --- Summary: Support parallelism for YARN Service Key: YARN-10567 URL: https://issues.apache.org/jira/browse/YARN-10567 Project: Hadoop YARN Issue Type: New Feature Reporter: kyungwan nam YARN Service support job-like by using "restart_policy" introduced in YARN-8080. But, we cannot set how many containers can be launched concurrently. This feature is something like "parallelism" in kubernetes. https://kubernetes.io/docs/concepts/workloads/controllers/job/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10305) Lost system-credentials when restarting RM
[ https://issues.apache.org/jira/browse/YARN-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140308#comment-17140308 ] kyungwan nam commented on YARN-10305: - Hi. [~eyang] [~prabhujoseph] I have seen this problem solved with this patch. Could you take a look at this patch? Thanks > Lost system-credentials when restarting RM > -- > > Key: YARN-10305 > URL: https://issues.apache.org/jira/browse/YARN-10305 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-10305.001.patch > > > System-credentials introduced in YARN-2704, it makes it to keep the > long-running apps. > I’ve met a situation where system-credentials lost when restarting RM. > Since then, if an app’s AM is stopped, restarting AM will be failed because > NMs do not have HDFS delegation token which is needed for resource > localization. > The app has a couple of delegation token including timeline-server token and > HDFS delegation token. > When restarting RM, RM will request a new HDFS delegation token for an app > that was submitted long ago. (It's fixed by YARN-5098) > But, If an app has a couple of delegation token and an exception occur in the > token processed first, the next tokens are not processed. > I think that’s why lost system-credentials. > Here are RM’s logs at the time of restarting RM. > {code} > 2020-05-19 14:25:05,712 WARN security.DelegationTokenRenewer > (DelegationTokenRenewer.java:handleDTRenewerAppRecoverEvent(955)) - Unable to > add the application to the delegation token renewer on recovery. > java.io.IOException: Failed to renew token: Kind: TIMELINE_DELEGATION_TOKEN, > Service: 10.1.1.1:8190, Ident: (TIMELINE_DELEGATION_TOKEN owner=test-admin, > renewer=yarn, realUser=yarn, issueDate=1586136363258, maxDate=1587000363258, > sequenceNumber=2193, masterKeyId=340) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:503) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleDTRenewerAppRecoverEvent(DelegationTokenRenewer.java:953) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:79) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:912) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: HTTP status [403], message > [org.apache.hadoop.security.token.SecretManager$InvalidToken: yarn tried to > renew an expired token (TIMELINE_DELEGATION_TOKEN owner=test-admin, > renewer=yarn, realUser=yarn, issueDate=1586136363258, maxDate=1587000363258, > sequenceNumber=2193, masterKeyId=340) max expiration date: 2020-04-16 > 10:26:03,258+0900 currentTime: 2020-05-19 14:25:05,700+0900] > at > org.apache.hadoop.util.HttpExceptionUtils.validateResponse(HttpExceptionUtils.java:166) > at > org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.doDelegationTokenOperation(DelegationTokenAuthenticator.java:319) > at > org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.renewDelegationToken(DelegationTokenAuthenticator.java:235) > at > org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticatedURL.renewDelegationToken(DelegationTokenAuthenticatedURL.java:437) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$2.run(TimelineClientImpl.java:247) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$2.run(TimelineClientImpl.java:227) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at > org.apache.hadoop.yarn.client.api.impl.TimelineConnector$TimelineClientRetryOpForOperateDelegationToken.run(TimelineConnector.java:431) > at > org.apache.hadoop.yarn.client.api.impl.TimelineConnector$TimelineClientConnectionRetry.retryOn(TimelineConnector.java:334) > at > org.apache.hadoop.yarn.client.api.impl.TimelineConnector.operateDelegationToken(TimelineConnector.java:218) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.renewDelegationToken(TimelineClientImpl.java:250) > at > org.apache.hadoop.yarn.security.client.Tim
[jira] [Commented] (YARN-10311) Yarn Service should support obtaining tokens from multiple name services
[ https://issues.apache.org/jira/browse/YARN-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138226#comment-17138226 ] kyungwan nam commented on YARN-10311: - Hi, I've met same issue in YARN-9905. I wanted to seperate the HDFS for log-aggregation under HDFS federation. but, It doesn't work due to this issue. Thanks~ > Yarn Service should support obtaining tokens from multiple name services > > > Key: YARN-10311 > URL: https://issues.apache.org/jira/browse/YARN-10311 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-10311.001.patch, YARN-10311.002.patch > > > Currently yarn services support single name service tokens. We can add a new > conf called > "yarn.service.hdfs-servers" for supporting this -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9905) yarn-service is failed to setup application log if app-log-dir is not default-fs
[ https://issues.apache.org/jira/browse/YARN-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136305#comment-17136305 ] kyungwan nam commented on YARN-9905: This looks the same as YARN-10311. Closing as duplicate. > yarn-service is failed to setup application log if app-log-dir is not > default-fs > > > Key: YARN-9905 > URL: https://issues.apache.org/jira/browse/YARN-9905 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9905.001.patch, YARN-9905.002.patch > > > Currently, yarn-service takes a token of default namenode only. > it might cause authentication failure under HDFS federation. > how to reproduce > - kerberized cluster > - multiple namespaces by HDFS federation. > - yarn.nodemanager.remote-app-log-dir is set to a namespace that is not > default-fs > here are the nodemanager logs at that time. > {code:java} > 2019-10-15 11:52:50,217 INFO containermanager.ContainerManagerImpl > (ContainerManagerImpl.java:startContainerInternal(1122)) - Creating a new > application reference for app application_1569373267731_9571 > 2019-10-15 11:52:50,217 INFO application.ApplicationImpl > (ApplicationImpl.java:handle(655)) - Application > application_1569373267731_9571 transitioned from NEW to INITING > ... > Failed on local exception: java.io.IOException: > org.apache.hadoop.security.AccessControlException: Client cannot authenticate > via:[TOKEN, KERBEROS] > at sun.reflect.GeneratedConstructorAccessor45.newInstance(Unknown > Source) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831) > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:806) > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1515) > at org.apache.hadoop.ipc.Client.call(Client.java:1457) > at org.apache.hadoop.ipc.Client.call(Client.java:1367) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > at com.sun.proxy.$Proxy24.getFileInfo(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:900) > at sun.reflect.GeneratedMethodAccessor32.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy25.getFileInfo(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1660) > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583) > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1580) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1595) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.checkExists(LogAggregationFileController.java:396) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController$1.run(LogAggregationFileController.java:338) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.createAppDir(LogAggregationFileController.java:323) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:254) > at > org.a
[jira] [Created] (YARN-10305) Lost system-credentials when restarting RM
kyungwan nam created YARN-10305: --- Summary: Lost system-credentials when restarting RM Key: YARN-10305 URL: https://issues.apache.org/jira/browse/YARN-10305 Project: Hadoop YARN Issue Type: Bug Reporter: kyungwan nam Assignee: kyungwan nam System-credentials introduced in YARN-2704, it makes it to keep the long-running apps. I’ve met a situation where system-credentials lost when restarting RM. Since then, if an app’s AM is stopped, restarting AM will be failed because NMs do not have HDFS delegation token which is needed for resource localization. The app has a couple of delegation token including timeline-server token and HDFS delegation token. When restarting RM, RM will request a new HDFS delegation token for an app that was submitted long ago. (It's fixed by YARN-5098) But, If an app has a couple of delegation token and an exception occur in the token processed first, the next tokens are not processed. I think that’s why lost system-credentials. Here are RM’s logs at the time of restarting RM. {code} 2020-05-19 14:25:05,712 WARN security.DelegationTokenRenewer (DelegationTokenRenewer.java:handleDTRenewerAppRecoverEvent(955)) - Unable to add the application to the delegation token renewer on recovery. java.io.IOException: Failed to renew token: Kind: TIMELINE_DELEGATION_TOKEN, Service: 10.1.1.1:8190, Ident: (TIMELINE_DELEGATION_TOKEN owner=test-admin, renewer=yarn, realUser=yarn, issueDate=1586136363258, maxDate=1587000363258, sequenceNumber=2193, masterKeyId=340) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:503) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleDTRenewerAppRecoverEvent(DelegationTokenRenewer.java:953) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:79) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:912) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: HTTP status [403], message [org.apache.hadoop.security.token.SecretManager$InvalidToken: yarn tried to renew an expired token (TIMELINE_DELEGATION_TOKEN owner=test-admin, renewer=yarn, realUser=yarn, issueDate=1586136363258, maxDate=1587000363258, sequenceNumber=2193, masterKeyId=340) max expiration date: 2020-04-16 10:26:03,258+0900 currentTime: 2020-05-19 14:25:05,700+0900] at org.apache.hadoop.util.HttpExceptionUtils.validateResponse(HttpExceptionUtils.java:166) at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.doDelegationTokenOperation(DelegationTokenAuthenticator.java:319) at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.renewDelegationToken(DelegationTokenAuthenticator.java:235) at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticatedURL.renewDelegationToken(DelegationTokenAuthenticatedURL.java:437) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$2.run(TimelineClientImpl.java:247) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$2.run(TimelineClientImpl.java:227) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) at org.apache.hadoop.yarn.client.api.impl.TimelineConnector$TimelineClientRetryOpForOperateDelegationToken.run(TimelineConnector.java:431) at org.apache.hadoop.yarn.client.api.impl.TimelineConnector$TimelineClientConnectionRetry.retryOn(TimelineConnector.java:334) at org.apache.hadoop.yarn.client.api.impl.TimelineConnector.operateDelegationToken(TimelineConnector.java:218) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.renewDelegationToken(TimelineClientImpl.java:250) at org.apache.hadoop.yarn.security.client.TimelineDelegationTokenIdentifier$Renewer.renew(TimelineDelegationTokenIdentifier.java:81) at org.apache.hadoop.security.token.Token.renew(Token.java:512) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:629) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:626) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422
[jira] [Created] (YARN-10267) Add description, version as allocationTags for YARN Service
kyungwan nam created YARN-10267: --- Summary: Add description, version as allocationTags for YARN Service Key: YARN-10267 URL: https://issues.apache.org/jira/browse/YARN-10267 Project: Hadoop YARN Issue Type: Improvement Reporter: kyungwan nam Assignee: kyungwan nam applicationTags for YARN Service only has the name. It makes it difficult to identify what kind of apps are. It would be good if description, version are added to applicationTags. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10262) Support application ACLs for YARN Service
kyungwan nam created YARN-10262: --- Summary: Support application ACLs for YARN Service Key: YARN-10262 URL: https://issues.apache.org/jira/browse/YARN-10262 Project: Hadoop YARN Issue Type: Improvement Reporter: kyungwan nam Assignee: kyungwan nam Currently, a user can access own yarn-service only. There’s no way to access the other user’s yarn-service. It makes it difficult to collaborate between users. User should be able to set the application ACLs for yarn-service. It's like mapreduce.job.acl-view-job, mapreduce.job.acl-modify-job for MapReduce. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10196) destroying app leaks zookeeper connection
[ https://issues.apache.org/jira/browse/YARN-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094308#comment-17094308 ] kyungwan nam commented on YARN-10196: - Hi. [~prabhujoseph], this definitely seems like a bug. Can you please take a look at this? the patch works well in my cluster. Thanks~ > destroying app leaks zookeeper connection > - > > Key: YARN-10196 > URL: https://issues.apache.org/jira/browse/YARN-10196 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-10196.001.patch, YARN-10196.002.patch > > > when destroying app, curatorClient in ServiceClient is started. but It is > never closed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10196) destroying app leaks zookeeper connection
[ https://issues.apache.org/jira/browse/YARN-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-10196: Attachment: YARN-10196.002.patch > destroying app leaks zookeeper connection > - > > Key: YARN-10196 > URL: https://issues.apache.org/jira/browse/YARN-10196 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-10196.001.patch, YARN-10196.002.patch > > > when destroying app, curatorClient in ServiceClient is started. but It is > never closed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10206) Service stuck in the STARTED state when it has a component having no instance
kyungwan nam created YARN-10206: --- Summary: Service stuck in the STARTED state when it has a component having no instance Key: YARN-10206 URL: https://issues.apache.org/jira/browse/YARN-10206 Project: Hadoop YARN Issue Type: Bug Reporter: kyungwan nam Assignee: kyungwan nam * 'compb' has no instance. it means 'number_of_containers' is 0 * 'compb' has a dependency on 'compa'. {code} "components": [ { "name”:”compa”, "number_of_containers": 1, "dependencies" : [ ] }, { "name":"compb”, "number_of_containers": 0, "dependencies" : [ "compa" ], {code} when launching the service, it stuck in the STARTED state -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10203) Stuck in express_upgrading if there is any component which has no instance
kyungwan nam created YARN-10203: --- Summary: Stuck in express_upgrading if there is any component which has no instance Key: YARN-10203 URL: https://issues.apache.org/jira/browse/YARN-10203 Project: Hadoop YARN Issue Type: Bug Reporter: kyungwan nam Assignee: kyungwan nam I was trying to "express upgrade" which introduced in YARN-8298. https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/yarn-service/ServiceUpgrade.html but, service state stuck in EXPRESS_UPGRADING. It happens only If there is any component that has no instance. ("number_of_containers" : 0) the component which has no instance should be excepted from upgrade target -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10034) Allocation tags are not removed when node decommission
[ https://issues.apache.org/jira/browse/YARN-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062421#comment-17062421 ] kyungwan nam commented on YARN-10034: - [~prabhujoseph], [~adam.antal] Thank you for the review and commit! > Allocation tags are not removed when node decommission > -- > > Key: YARN-10034 > URL: https://issues.apache.org/jira/browse/YARN-10034 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-10034.001.patch, YARN-10034.002.patch, > YARN-10034.003.patch > > > When a node is decommissioned, allocation tags that are attached to the node > are not removed. > I could see that allocation tags are revived when recommissioning the node. > RM removes allocation tags only if NM confirms the container releases by > YARN-8511. but, decommissioned NM does not connect to RM anymore. > Once a node is decommissioned, allocation tags that attached to the node > should be removed immediately. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10184) NPE happens in NMClient when reinitializeContainer
[ https://issues.apache.org/jira/browse/YARN-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-10184: Attachment: YARN-10184.002.patch > NPE happens in NMClient when reinitializeContainer > -- > > Key: YARN-10184 > URL: https://issues.apache.org/jira/browse/YARN-10184 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-10184.001.patch, YARN-10184.002.patch > > > NPE happens in NMClient when upgrading a yarn-service app which AM has been > restarted. > Here is AM’s log at the time of the NPE. > {code} > 2020-02-20 16:43:35,962 [Container Event Dispatcher] ERROR > yarn.YarnUncaughtExceptionHandler - Thread Thread[Container Event > Dispatcher,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl$1.run(NMClientAsyncImpl.java:172) > 2020-02-20 16:43:36,398 [AMRM Callback Handler Thread] WARN > service.ServiceScheduler - Container > container_e58_1581930783345_1954_01_06 Completed. No component instance > exists. exitStatus=-100. diagnostics=Container released by application > {code} > NMClient keeps containers since the container has been started. > But, when restarting AM, NMClient is initialized and previous containers are > lost. > Since then, NPE will happen when reinitializeContainer is requested. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10196) destroying app leaks zookeeper connection
[ https://issues.apache.org/jira/browse/YARN-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-10196: Attachment: YARN-10196.001.patch > destroying app leaks zookeeper connection > - > > Key: YARN-10196 > URL: https://issues.apache.org/jira/browse/YARN-10196 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-10196.001.patch > > > when destroying app, curatorClient in ServiceClient is started. but It is > never closed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10196) destroying app leaks zookeeper connection
kyungwan nam created YARN-10196: --- Summary: destroying app leaks zookeeper connection Key: YARN-10196 URL: https://issues.apache.org/jira/browse/YARN-10196 Project: Hadoop YARN Issue Type: Bug Reporter: kyungwan nam Assignee: kyungwan nam when destroying app, curatorClient in ServiceClient is started. but It is never closed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10190) Typo in NMClientAsyncImpl
kyungwan nam created YARN-10190: --- Summary: Typo in NMClientAsyncImpl Key: YARN-10190 URL: https://issues.apache.org/jira/browse/YARN-10190 Project: Hadoop YARN Issue Type: Bug Reporter: kyungwan nam Assignee: kyungwan nam Small typo in NMClientAsyncImpl.java * ReInitializeContainerEvevnt -> ReInitializeContainerEvent * containerLaunchContex -> containerLaunchContext -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10184) NPE happens in NMClient when reinitializeContainer
kyungwan nam created YARN-10184: --- Summary: NPE happens in NMClient when reinitializeContainer Key: YARN-10184 URL: https://issues.apache.org/jira/browse/YARN-10184 Project: Hadoop YARN Issue Type: Bug Reporter: kyungwan nam Assignee: kyungwan nam NPE happens in NMClient when upgrading a yarn-service app which AM has been restarted. Here is AM’s log at the time of the NPE. {code} 2020-02-20 16:43:35,962 [Container Event Dispatcher] ERROR yarn.YarnUncaughtExceptionHandler - Thread Thread[Container Event Dispatcher,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl$1.run(NMClientAsyncImpl.java:172) 2020-02-20 16:43:36,398 [AMRM Callback Handler Thread] WARN service.ServiceScheduler - Container container_e58_1581930783345_1954_01_06 Completed. No component instance exists. exitStatus=-100. diagnostics=Container released by application {code} NMClient keeps containers since the container has been started. But, when restarting AM, NMClient is initialized and previous containers are lost. Since then, NPE will happen when reinitializeContainer is requested. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10034) Allocation tags are not removed when node decommission
[ https://issues.apache.org/jira/browse/YARN-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041690#comment-17041690 ] kyungwan nam commented on YARN-10034: - [~prabhujoseph], Can you please take a look at this? Thanks > Allocation tags are not removed when node decommission > -- > > Key: YARN-10034 > URL: https://issues.apache.org/jira/browse/YARN-10034 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-10034.001.patch, YARN-10034.002.patch, > YARN-10034.003.patch > > > When a node is decommissioned, allocation tags that are attached to the node > are not removed. > I could see that allocation tags are revived when recommissioning the node. > RM removes allocation tags only if NM confirms the container releases by > YARN-8511. but, decommissioned NM does not connect to RM anymore. > Once a node is decommissioned, allocation tags that attached to the node > should be removed immediately. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10119) Cannot reset the AM failure count for YARN Service
[ https://issues.apache.org/jira/browse/YARN-10119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041424#comment-17041424 ] kyungwan nam commented on YARN-10119: - Thanks [~prabhujoseph] for your review and commit. > Cannot reset the AM failure count for YARN Service > -- > > Key: YARN-10119 > URL: https://issues.apache.org/jira/browse/YARN-10119 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.2 >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Labels: Reviewed > Fix For: 3.3.0 > > Attachments: YARN-10119.001.patch > > > Currently, YARN Service does not support to reset AM failure count, which > introduced in YARN-611 > Since the AM failure count is never reset, eventually that will reach > yarn.service.am-restart.max-attempts and the app will be stopped. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10119) Cannot reset the AM failure count for YARN Service
[ https://issues.apache.org/jira/browse/YARN-10119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036621#comment-17036621 ] kyungwan nam commented on YARN-10119: - [~prabhujoseph], Can you please take a look at this? Thanks! > Cannot reset the AM failure count for YARN Service > -- > > Key: YARN-10119 > URL: https://issues.apache.org/jira/browse/YARN-10119 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.2 >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-10119.001.patch > > > Currently, YARN Service does not support to reset AM failure count, which > introduced in YARN-611 > Since the AM failure count is never reset, eventually that will reach > yarn.service.am-restart.max-attempts and the app will be stopped. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9521) RM failed to start due to system services
[ https://issues.apache.org/jira/browse/YARN-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035913#comment-17035913 ] kyungwan nam commented on YARN-9521: [~prabhujoseph] Thank you for your review and commit > RM failed to start due to system services > - > > Key: YARN-9521 > URL: https://issues.apache.org/jira/browse/YARN-9521 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.2 >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Labels: Reviewed > Fix For: 3.3.0 > > Attachments: YARN-9521.001.patch, YARN-9521.002.patch, > YARN-9521.003.patch, YARN-9521.004.patch > > > when starting RM, listing system services directory has failed as follows. > {code} > 2019-04-30 17:18:25,441 INFO client.SystemServiceManagerImpl > (SystemServiceManagerImpl.java:serviceInit(114)) - System Service Directory > is configured to /services > 2019-04-30 17:18:25,467 INFO client.SystemServiceManagerImpl > (SystemServiceManagerImpl.java:serviceInit(120)) - UserGroupInformation > initialized to yarn (auth:SIMPLE) > 2019-04-30 17:18:25,467 INFO service.AbstractService > (AbstractService.java:noteFailure(267)) - Service ResourceManager failed in > state STARTED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > Filesystem closed > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:203) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:869) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1228) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1269) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1265) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1265) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1316) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1501) > Caused by: java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:473) > at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1639) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1217) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1233) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1200) > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1179) > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1175) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1187) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.list(SystemServiceManagerImpl.java:375) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.scanForUserServices(SystemServiceManagerImpl.java:282) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.serviceStart(SystemServiceManagerImpl.java:126) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > ... 13 more > {code} > it looks like due to the usage of filesystem cache. > this issue does not happen, when I add "fs.hdfs.impl.disable.cache=true" to > yarn-site -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9521) RM failed to start due to system services
[ https://issues.apache.org/jira/browse/YARN-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-9521: --- Attachment: YARN-9521.004.patch > RM failed to start due to system services > - > > Key: YARN-9521 > URL: https://issues.apache.org/jira/browse/YARN-9521 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.2 >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9521.001.patch, YARN-9521.002.patch, > YARN-9521.003.patch, YARN-9521.004.patch > > > when starting RM, listing system services directory has failed as follows. > {code} > 2019-04-30 17:18:25,441 INFO client.SystemServiceManagerImpl > (SystemServiceManagerImpl.java:serviceInit(114)) - System Service Directory > is configured to /services > 2019-04-30 17:18:25,467 INFO client.SystemServiceManagerImpl > (SystemServiceManagerImpl.java:serviceInit(120)) - UserGroupInformation > initialized to yarn (auth:SIMPLE) > 2019-04-30 17:18:25,467 INFO service.AbstractService > (AbstractService.java:noteFailure(267)) - Service ResourceManager failed in > state STARTED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > Filesystem closed > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:203) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:869) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1228) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1269) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1265) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1265) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1316) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1501) > Caused by: java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:473) > at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1639) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1217) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1233) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1200) > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1179) > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1175) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1187) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.list(SystemServiceManagerImpl.java:375) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.scanForUserServices(SystemServiceManagerImpl.java:282) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.serviceStart(SystemServiceManagerImpl.java:126) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > ... 13 more > {code} > it looks like due to the usage of filesystem cache. > this issue does not happen, when I add "fs.hdfs.impl.disable.cache=true" to > yarn-site -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9521) RM failed to start due to system services
[ https://issues.apache.org/jira/browse/YARN-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035021#comment-17035021 ] kyungwan nam commented on YARN-9521: Attaches a new patch including test code. [~eyang], [~prabhujoseph] Could you take a look it when you are available? Thanks! > RM failed to start due to system services > - > > Key: YARN-9521 > URL: https://issues.apache.org/jira/browse/YARN-9521 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.2 >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9521.001.patch, YARN-9521.002.patch, > YARN-9521.003.patch > > > when starting RM, listing system services directory has failed as follows. > {code} > 2019-04-30 17:18:25,441 INFO client.SystemServiceManagerImpl > (SystemServiceManagerImpl.java:serviceInit(114)) - System Service Directory > is configured to /services > 2019-04-30 17:18:25,467 INFO client.SystemServiceManagerImpl > (SystemServiceManagerImpl.java:serviceInit(120)) - UserGroupInformation > initialized to yarn (auth:SIMPLE) > 2019-04-30 17:18:25,467 INFO service.AbstractService > (AbstractService.java:noteFailure(267)) - Service ResourceManager failed in > state STARTED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > Filesystem closed > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:203) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:869) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1228) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1269) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1265) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1265) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1316) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1501) > Caused by: java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:473) > at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1639) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1217) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1233) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1200) > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1179) > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1175) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1187) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.list(SystemServiceManagerImpl.java:375) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.scanForUserServices(SystemServiceManagerImpl.java:282) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.serviceStart(SystemServiceManagerImpl.java:126) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > ... 13 more > {code} > it looks like due to the usage of filesystem cache. > this issue does not happen, when I add "fs.hdfs.impl.disable.cache=true" to > yarn-site -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9521) RM failed to start due to system services
[ https://issues.apache.org/jira/browse/YARN-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-9521: --- Attachment: YARN-9521.003.patch > RM failed to start due to system services > - > > Key: YARN-9521 > URL: https://issues.apache.org/jira/browse/YARN-9521 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.2 >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9521.001.patch, YARN-9521.002.patch, > YARN-9521.003.patch > > > when starting RM, listing system services directory has failed as follows. > {code} > 2019-04-30 17:18:25,441 INFO client.SystemServiceManagerImpl > (SystemServiceManagerImpl.java:serviceInit(114)) - System Service Directory > is configured to /services > 2019-04-30 17:18:25,467 INFO client.SystemServiceManagerImpl > (SystemServiceManagerImpl.java:serviceInit(120)) - UserGroupInformation > initialized to yarn (auth:SIMPLE) > 2019-04-30 17:18:25,467 INFO service.AbstractService > (AbstractService.java:noteFailure(267)) - Service ResourceManager failed in > state STARTED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > Filesystem closed > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:203) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:869) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1228) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1269) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1265) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1265) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1316) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1501) > Caused by: java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:473) > at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1639) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1217) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1233) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1200) > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1179) > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1175) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1187) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.list(SystemServiceManagerImpl.java:375) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.scanForUserServices(SystemServiceManagerImpl.java:282) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.serviceStart(SystemServiceManagerImpl.java:126) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > ... 13 more > {code} > it looks like due to the usage of filesystem cache. > this issue does not happen, when I add "fs.hdfs.impl.disable.cache=true" to > yarn-site -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10113) SystemServiceManagerImpl fails to initialize
[ https://issues.apache.org/jira/browse/YARN-10113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034114#comment-17034114 ] kyungwan nam commented on YARN-10113: - Hi. [~prabhujoseph], [~eyang]. I believe this is the same as YARN-9521. The FileSystem object for RM login user can be closed by ApiServiceClient.actionCleanUp. the patch in YARN-9521 is to perform ApiServiceClient.actionCleanUp inside ugi.doAs(). It works well in my cluster (Hadoop-3.1.2) Please let me know if I'm wrong. Thanks! > SystemServiceManagerImpl fails to initialize > - > > Key: YARN-10113 > URL: https://issues.apache.org/jira/browse/YARN-10113 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-10113-001.patch, YARN-10113-002.patch > > > RM fails to start with SystemServiceManagerImpl failed to initialize. > {code} > 2020-01-28 12:20:43,631 WARN ha.ActiveStandbyElector > (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the > winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:476) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:636) > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when > transitioning to Active mode > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:325) > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144) > ... 4 more > Caused by: org.apache.hadoop.service.ServiceStateException: > java.io.IOException: Filesystem closed > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:203) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:881) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1257) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1298) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1294) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1294) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:320) > ... 5 more > Caused by: java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:475) > at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1645) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1219) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1235) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1202) > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1181) > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1177) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1189) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.list(SystemServiceManagerImpl.java:375) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.scanForUserServices(SystemServiceManag
[jira] [Created] (YARN-10119) Cannot reset the AM failure count for YARN Service
kyungwan nam created YARN-10119: --- Summary: Cannot reset the AM failure count for YARN Service Key: YARN-10119 URL: https://issues.apache.org/jira/browse/YARN-10119 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.1.2 Reporter: kyungwan nam Assignee: kyungwan nam Currently, YARN Service does not support to reset AM failure count, which introduced in YARN-611 Since the AM failure count is never reset, eventually that will reach yarn.service.am-restart.max-attempts and the app will be stopped. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10034) Allocation tags are not removed when node decommission
[ https://issues.apache.org/jira/browse/YARN-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006521#comment-17006521 ] kyungwan nam commented on YARN-10034: - fixes checkstyle. I don't think test failure is related to this issue. [~cheersyang] Sorry for bothering you. Could you review this? > Allocation tags are not removed when node decommission > -- > > Key: YARN-10034 > URL: https://issues.apache.org/jira/browse/YARN-10034 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-10034.001.patch, YARN-10034.002.patch, > YARN-10034.003.patch > > > When a node is decommissioned, allocation tags that are attached to the node > are not removed. > I could see that allocation tags are revived when recommissioning the node. > RM removes allocation tags only if NM confirms the container releases by > YARN-8511. but, decommissioned NM does not connect to RM anymore. > Once a node is decommissioned, allocation tags that attached to the node > should be removed immediately. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10034) Allocation tags are not removed when node decommission
[ https://issues.apache.org/jira/browse/YARN-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-10034: Attachment: YARN-10034.003.patch > Allocation tags are not removed when node decommission > -- > > Key: YARN-10034 > URL: https://issues.apache.org/jira/browse/YARN-10034 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-10034.001.patch, YARN-10034.002.patch, > YARN-10034.003.patch > > > When a node is decommissioned, allocation tags that are attached to the node > are not removed. > I could see that allocation tags are revived when recommissioning the node. > RM removes allocation tags only if NM confirms the container releases by > YARN-8511. but, decommissioned NM does not connect to RM anymore. > Once a node is decommissioned, allocation tags that attached to the node > should be removed immediately. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10034) Allocation tags are not removed when node decommission
[ https://issues.apache.org/jira/browse/YARN-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17005923#comment-17005923 ] kyungwan nam commented on YARN-10034: - attaches a new patch including test code > Allocation tags are not removed when node decommission > -- > > Key: YARN-10034 > URL: https://issues.apache.org/jira/browse/YARN-10034 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-10034.001.patch, YARN-10034.002.patch > > > When a node is decommissioned, allocation tags that are attached to the node > are not removed. > I could see that allocation tags are revived when recommissioning the node. > RM removes allocation tags only if NM confirms the container releases by > YARN-8511. but, decommissioned NM does not connect to RM anymore. > Once a node is decommissioned, allocation tags that attached to the node > should be removed immediately. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10034) Allocation tags are not removed when node decommission
[ https://issues.apache.org/jira/browse/YARN-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-10034: Attachment: YARN-10034.002.patch > Allocation tags are not removed when node decommission > -- > > Key: YARN-10034 > URL: https://issues.apache.org/jira/browse/YARN-10034 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-10034.001.patch, YARN-10034.002.patch > > > When a node is decommissioned, allocation tags that are attached to the node > are not removed. > I could see that allocation tags are revived when recommissioning the node. > RM removes allocation tags only if NM confirms the container releases by > YARN-8511. but, decommissioned NM does not connect to RM anymore. > Once a node is decommissioned, allocation tags that attached to the node > should be removed immediately. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10034) Allocation tags are not removed when node decommission
[ https://issues.apache.org/jira/browse/YARN-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam reassigned YARN-10034: --- Attachment: YARN-10034.001.patch Assignee: kyungwan nam attaches a patch. please review or comment. thanks. > Allocation tags are not removed when node decommission > -- > > Key: YARN-10034 > URL: https://issues.apache.org/jira/browse/YARN-10034 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-10034.001.patch > > > When a node is decommissioned, allocation tags that are attached to the node > are not removed. > I could see that allocation tags are revived when recommissioning the node. > RM removes allocation tags only if NM confirms the container releases by > YARN-8511. but, decommissioned NM does not connect to RM anymore. > Once a node is decommissioned, allocation tags that attached to the node > should be removed immediately. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10034) Allocation tags are not removed when node decommission
kyungwan nam created YARN-10034: --- Summary: Allocation tags are not removed when node decommission Key: YARN-10034 URL: https://issues.apache.org/jira/browse/YARN-10034 Project: Hadoop YARN Issue Type: Bug Reporter: kyungwan nam When a node is decommissioned, allocation tags that are attached to the node are not removed. I could see that allocation tags are revived when recommissioning the node. RM removes allocation tags only if NM confirms the container releases by YARN-8511. but, decommissioned NM does not connect to RM anymore. Once a node is decommissioned, allocation tags that attached to the node should be removed immediately. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10021) NPE in YARN Registry DNS when wrong DNS message is incoming
[ https://issues.apache.org/jira/browse/YARN-10021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam reassigned YARN-10021: --- Attachment: YARN-10021.001.patch Assignee: kyungwan nam > NPE in YARN Registry DNS when wrong DNS message is incoming > --- > > Key: YARN-10021 > URL: https://issues.apache.org/jira/browse/YARN-10021 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-10021.001.patch > > > I’ve met NPE in YARN Registry DNS as below. > It looks like this happens if the incoming DNS request is the wrong format. > {code:java} > 2019-11-29 10:51:12,178 ERROR dns.RegistryDNS (RegistryDNS.java:call(932)) - > Error initializing DNS UDP listener > java.lang.NullPointerException > at java.nio.ByteBuffer.put(ByteBuffer.java:859) > at > org.apache.hadoop.registry.server.dns.RegistryDNS.serveNIOUDP(RegistryDNS.java:983) > at > org.apache.hadoop.registry.server.dns.RegistryDNS.access$100(RegistryDNS.java:121) > at > org.apache.hadoop.registry.server.dns.RegistryDNS$5.call(RegistryDNS.java:930) > at > org.apache.hadoop.registry.server.dns.RegistryDNS$5.call(RegistryDNS.java:926) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2019-11-29 10:51:12,180 WARN concurrent.ExecutorHelper > (ExecutorHelper.java:logThrowableFromAfterExecute(50)) - Execution exception > when running task in RegistryDNS 1 > 2019-11-29 10:51:12,180 WARN concurrent.ExecutorHelper > (ExecutorHelper.java:logThrowableFromAfterExecute(63)) - Caught exception in > thread RegistryDNS 1: > java.lang.NullPointerException > at java.nio.ByteBuffer.put(ByteBuffer.java:859) > at > org.apache.hadoop.registry.server.dns.RegistryDNS.serveNIOUDP(RegistryDNS.java:983) > at > org.apache.hadoop.registry.server.dns.RegistryDNS.access$100(RegistryDNS.java:121) > at > org.apache.hadoop.registry.server.dns.RegistryDNS$5.call(RegistryDNS.java:930) > at > org.apache.hadoop.registry.server.dns.RegistryDNS$5.call(RegistryDNS.java:926) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10021) NPE in YARN Registry DNS when wrong DNS message is incoming
kyungwan nam created YARN-10021: --- Summary: NPE in YARN Registry DNS when wrong DNS message is incoming Key: YARN-10021 URL: https://issues.apache.org/jira/browse/YARN-10021 Project: Hadoop YARN Issue Type: Bug Reporter: kyungwan nam I’ve met NPE in YARN Registry DNS as below. It looks like this happens if the incoming DNS request is the wrong format. {code:java} 2019-11-29 10:51:12,178 ERROR dns.RegistryDNS (RegistryDNS.java:call(932)) - Error initializing DNS UDP listener java.lang.NullPointerException at java.nio.ByteBuffer.put(ByteBuffer.java:859) at org.apache.hadoop.registry.server.dns.RegistryDNS.serveNIOUDP(RegistryDNS.java:983) at org.apache.hadoop.registry.server.dns.RegistryDNS.access$100(RegistryDNS.java:121) at org.apache.hadoop.registry.server.dns.RegistryDNS$5.call(RegistryDNS.java:930) at org.apache.hadoop.registry.server.dns.RegistryDNS$5.call(RegistryDNS.java:926) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2019-11-29 10:51:12,180 WARN concurrent.ExecutorHelper (ExecutorHelper.java:logThrowableFromAfterExecute(50)) - Execution exception when running task in RegistryDNS 1 2019-11-29 10:51:12,180 WARN concurrent.ExecutorHelper (ExecutorHelper.java:logThrowableFromAfterExecute(63)) - Caught exception in thread RegistryDNS 1: java.lang.NullPointerException at java.nio.ByteBuffer.put(ByteBuffer.java:859) at org.apache.hadoop.registry.server.dns.RegistryDNS.serveNIOUDP(RegistryDNS.java:983) at org.apache.hadoop.registry.server.dns.RegistryDNS.access$100(RegistryDNS.java:121) at org.apache.hadoop.registry.server.dns.RegistryDNS$5.call(RegistryDNS.java:930) at org.apache.hadoop.registry.server.dns.RegistryDNS$5.call(RegistryDNS.java:926) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9986) signalToContainer REST API does not work even if requested by the app owner
[ https://issues.apache.org/jira/browse/YARN-9986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16977263#comment-16977263 ] kyungwan nam commented on YARN-9986: [~prabhujoseph], thank you for your comment. I've attached a new patch with the modified test code. > signalToContainer REST API does not work even if requested by the app owner > --- > > Key: YARN-9986 > URL: https://issues.apache.org/jira/browse/YARN-9986 > Project: Hadoop YARN > Issue Type: Bug > Components: restapi >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9986.001.patch, YARN-9986.002.patch > > > signalToContainer REST API introduced in YARN-8693 does not work even if > requested by the app owner. > It works well only if requested by an admin user > {code} > $ kinit kwnam > Password for kw...@test.org: > $ curl -H 'Content-Type: application/json' --negotiate -u : -X POST > https://rm002.test.org:8088/ws/v1/cluster/containers/container_e58_1573625560605_29927_01_01/signal/GRACEFUL_SHUTDOWN > {"RemoteException":{"exception":"ForbiddenException","message":"java.lang.Exception: > Only admins can carry out this > operation.","javaClassName":"org.apache.hadoop.yarn.webapp.ForbiddenException"}}$ > $ kinit admin > Password for ad...@test.org: > $ > $ curl -H 'Content-Type: application/json' --negotiate -u : -X POST > https://rm002.test.org:8088/ws/v1/cluster/containers/container_e58_1573625560605_29927_01_01/signal/GRACEFUL_SHUTDOWN > $ > {code} > in contrast, the app owner can do it using the command line as below. > {code} > $ kinit kwnam > Password for kw...@test.org: > $ yarn container -signal container_e58_1573625560605_29927_01_02 > GRACEFUL_SHUTDOWN > Signalling container container_e58_1573625560605_29927_01_02 > 2019-11-19 09:12:29,797 INFO impl.YarnClientImpl: Signalling container > container_e58_1573625560605_29927_01_02 with command GRACEFUL_SHUTDOWN > 2019-11-19 09:12:29,920 INFO client.ConfiguredRMFailoverProxyProvider: > Failing over to rm2 > $ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9986) signalToContainer REST API does not work even if requested by the app owner
[ https://issues.apache.org/jira/browse/YARN-9986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-9986: --- Attachment: YARN-9986.002.patch > signalToContainer REST API does not work even if requested by the app owner > --- > > Key: YARN-9986 > URL: https://issues.apache.org/jira/browse/YARN-9986 > Project: Hadoop YARN > Issue Type: Bug > Components: restapi >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9986.001.patch, YARN-9986.002.patch > > > signalToContainer REST API introduced in YARN-8693 does not work even if > requested by the app owner. > It works well only if requested by an admin user > {code} > $ kinit kwnam > Password for kw...@test.org: > $ curl -H 'Content-Type: application/json' --negotiate -u : -X POST > https://rm002.test.org:8088/ws/v1/cluster/containers/container_e58_1573625560605_29927_01_01/signal/GRACEFUL_SHUTDOWN > {"RemoteException":{"exception":"ForbiddenException","message":"java.lang.Exception: > Only admins can carry out this > operation.","javaClassName":"org.apache.hadoop.yarn.webapp.ForbiddenException"}}$ > $ kinit admin > Password for ad...@test.org: > $ > $ curl -H 'Content-Type: application/json' --negotiate -u : -X POST > https://rm002.test.org:8088/ws/v1/cluster/containers/container_e58_1573625560605_29927_01_01/signal/GRACEFUL_SHUTDOWN > $ > {code} > in contrast, the app owner can do it using the command line as below. > {code} > $ kinit kwnam > Password for kw...@test.org: > $ yarn container -signal container_e58_1573625560605_29927_01_02 > GRACEFUL_SHUTDOWN > Signalling container container_e58_1573625560605_29927_01_02 > 2019-11-19 09:12:29,797 INFO impl.YarnClientImpl: Signalling container > container_e58_1573625560605_29927_01_02 with command GRACEFUL_SHUTDOWN > 2019-11-19 09:12:29,920 INFO client.ConfiguredRMFailoverProxyProvider: > Failing over to rm2 > $ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9986) signalToContainer REST API does not work even if requested by the app owner
kyungwan nam created YARN-9986: -- Summary: signalToContainer REST API does not work even if requested by the app owner Key: YARN-9986 URL: https://issues.apache.org/jira/browse/YARN-9986 Project: Hadoop YARN Issue Type: Bug Components: restapi Reporter: kyungwan nam Assignee: kyungwan nam signalToContainer REST API introduced in YARN-8693 does not work even if requested by the app owner. It works well only if requested by an admin user {code} $ kinit kwnam Password for kw...@test.org: $ curl -H 'Content-Type: application/json' --negotiate -u : -X POST https://rm002.test.org:8088/ws/v1/cluster/containers/container_e58_1573625560605_29927_01_01/signal/GRACEFUL_SHUTDOWN {"RemoteException":{"exception":"ForbiddenException","message":"java.lang.Exception: Only admins can carry out this operation.","javaClassName":"org.apache.hadoop.yarn.webapp.ForbiddenException"}}$ $ kinit admin Password for ad...@test.org: $ $ curl -H 'Content-Type: application/json' --negotiate -u : -X POST https://rm002.test.org:8088/ws/v1/cluster/containers/container_e58_1573625560605_29927_01_01/signal/GRACEFUL_SHUTDOWN $ {code} in contrast, the app owner can do it using the command line as below. {code} $ kinit kwnam Password for kw...@test.org: $ yarn container -signal container_e58_1573625560605_29927_01_02 GRACEFUL_SHUTDOWN Signalling container container_e58_1573625560605_29927_01_02 2019-11-19 09:12:29,797 INFO impl.YarnClientImpl: Signalling container container_e58_1573625560605_29927_01_02 with command GRACEFUL_SHUTDOWN 2019-11-19 09:12:29,920 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 $ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9953) YARN Service dependency should be configurable for each app
[ https://issues.apache.org/jira/browse/YARN-9953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam reassigned YARN-9953: -- Attachment: YARN-9953.001.patch Assignee: kyungwan nam yarn.service.framework.path can be set in yarnfile. if it does not exist in yarnfile, it respects as configured in RM. > YARN Service dependency should be configurable for each app > --- > > Key: YARN-9953 > URL: https://issues.apache.org/jira/browse/YARN-9953 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.2 >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9953.001.patch > > > Currently, YARN Service dependency can be set as yarn.service.framework.path. > But, It works only as configured in RM. > This makes it impossible for the user to choose their YARN Service dependency. > It should be configurable for each app. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9953) YARN Service dependency should be configurable for each app
[ https://issues.apache.org/jira/browse/YARN-9953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-9953: --- Affects Version/s: 3.1.2 > YARN Service dependency should be configurable for each app > --- > > Key: YARN-9953 > URL: https://issues.apache.org/jira/browse/YARN-9953 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.2 >Reporter: kyungwan nam >Priority: Major > > Currently, YARN Service dependency can be set as yarn.service.framework.path. > But, It works only as configured in RM. > This makes it impossible for the user to choose their YARN Service dependency. > It should be configurable for each app. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9953) YARN Service dependency should be configurable for each app
kyungwan nam created YARN-9953: -- Summary: YARN Service dependency should be configurable for each app Key: YARN-9953 URL: https://issues.apache.org/jira/browse/YARN-9953 Project: Hadoop YARN Issue Type: Bug Reporter: kyungwan nam Currently, YARN Service dependency can be set as yarn.service.framework.path. But, It works only as configured in RM. This makes it impossible for the user to choose their YARN Service dependency. It should be configurable for each app. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9929) NodeManager OOM because of stuck DeletionService
[ https://issues.apache.org/jira/browse/YARN-9929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16957039#comment-16957039 ] kyungwan nam commented on YARN-9929: attaches a patch, which set the timeout for _ShellCommandExecutor_ any comments and suggestions are welcome > NodeManager OOM because of stuck DeletionService > > > Key: YARN-9929 > URL: https://issues.apache.org/jira/browse/YARN-9929 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.2 >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9929.001.patch, nm_heapdump.png > > > NMs go through frequent Full GC due to a lack of heap memory. > we can find a lot of FileDeletionTask, DockerContainerDeletionTask from the > heap dump (screenshot is attached) > and after analyzing the thread dump, we can figure out _DeletionService_ gets > stuck in _executeStatusCommand_ which run 'docker inspect' > {code:java} > "DeletionService #0" - Thread t@41 >java.lang.Thread.State: RUNNABLE > at java.io.FileInputStream.readBytes(Native Method) > at java.io.FileInputStream.read(FileInputStream.java:255) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > - locked <649fc0cf> (a java.lang.UNIXProcess$ProcessPipeInputStream) > at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) > at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) > - locked <3e45c938> (a java.io.InputStreamReader) > at java.io.InputStreamReader.read(InputStreamReader.java:184) > at java.io.BufferedReader.fill(BufferedReader.java:161) > at java.io.BufferedReader.read1(BufferedReader.java:212) > at java.io.BufferedReader.read(BufferedReader.java:286) > - locked <3e45c938> (a java.io.InputStreamReader) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:1240) > at org.apache.hadoop.util.Shell.runCommand(Shell.java:995) > at org.apache.hadoop.util.Shell.run(Shell.java:902) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1227) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeDockerCommand(DockerCommandExecutor.java:91) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeStatusCommand(DockerCommandExecutor.java:180) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.getContainerStatus(DockerCommandExecutor.java:118) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.removeDockerContainer(LinuxContainerExecutor.java:937) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.deletion.task.DockerContainerDeletionTask.run(DockerContainerDeletionTask.java:61) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) >Locked ownable synchronizers: > - locked <4cc6fa2a> (a java.util.concurrent.ThreadPoolExecutor$Worker) > {code} > also, we found 'docker inspect' processes are running for a long time as > follows. > {code:java} > root 95637 0.0 0.0 2650984 35776 ? Sl Aug23 5:48 > /usr/bin/docker inspect --format={{.State.Status}} > container_e30_1555419799458_0014_01_30 > root 95638 0.0 0.0 2773860 33908 ? Sl Aug23 5:33 > /usr/bin/docker inspect --format={{.State.Status}} > container_e50_1561100493387_25316_01_001455 > root 95641 0.0 0.0 2445924 34204 ? Sl Aug23 5:34 > /usr/bin/docker inspect --format={{.State.Status}} > container_e49_1560851258686_2107_01_24 > root 95643 0.0 0.0 2642532 34428 ? Sl Aug23 5:30 > /usr/bin/docker inspect --format={{.State.Status}} > container_e50_1561100493387_8111_01_
[jira] [Updated] (YARN-9929) NodeManager OOM because of stuck DeletionService
[ https://issues.apache.org/jira/browse/YARN-9929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-9929: --- Attachment: YARN-9929.001.patch > NodeManager OOM because of stuck DeletionService > > > Key: YARN-9929 > URL: https://issues.apache.org/jira/browse/YARN-9929 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.2 >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9929.001.patch, nm_heapdump.png > > > NMs go through frequent Full GC due to a lack of heap memory. > we can find a lot of FileDeletionTask, DockerContainerDeletionTask from the > heap dump (screenshot is attached) > and after analyzing the thread dump, we can figure out _DeletionService_ gets > stuck in _executeStatusCommand_ which run 'docker inspect' > {code:java} > "DeletionService #0" - Thread t@41 >java.lang.Thread.State: RUNNABLE > at java.io.FileInputStream.readBytes(Native Method) > at java.io.FileInputStream.read(FileInputStream.java:255) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > - locked <649fc0cf> (a java.lang.UNIXProcess$ProcessPipeInputStream) > at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) > at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) > - locked <3e45c938> (a java.io.InputStreamReader) > at java.io.InputStreamReader.read(InputStreamReader.java:184) > at java.io.BufferedReader.fill(BufferedReader.java:161) > at java.io.BufferedReader.read1(BufferedReader.java:212) > at java.io.BufferedReader.read(BufferedReader.java:286) > - locked <3e45c938> (a java.io.InputStreamReader) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:1240) > at org.apache.hadoop.util.Shell.runCommand(Shell.java:995) > at org.apache.hadoop.util.Shell.run(Shell.java:902) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1227) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeDockerCommand(DockerCommandExecutor.java:91) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeStatusCommand(DockerCommandExecutor.java:180) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.getContainerStatus(DockerCommandExecutor.java:118) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.removeDockerContainer(LinuxContainerExecutor.java:937) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.deletion.task.DockerContainerDeletionTask.run(DockerContainerDeletionTask.java:61) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) >Locked ownable synchronizers: > - locked <4cc6fa2a> (a java.util.concurrent.ThreadPoolExecutor$Worker) > {code} > also, we found 'docker inspect' processes are running for a long time as > follows. > {code:java} > root 95637 0.0 0.0 2650984 35776 ? Sl Aug23 5:48 > /usr/bin/docker inspect --format={{.State.Status}} > container_e30_1555419799458_0014_01_30 > root 95638 0.0 0.0 2773860 33908 ? Sl Aug23 5:33 > /usr/bin/docker inspect --format={{.State.Status}} > container_e50_1561100493387_25316_01_001455 > root 95641 0.0 0.0 2445924 34204 ? Sl Aug23 5:34 > /usr/bin/docker inspect --format={{.State.Status}} > container_e49_1560851258686_2107_01_24 > root 95643 0.0 0.0 2642532 34428 ? Sl Aug23 5:30 > /usr/bin/docker inspect --format={{.State.Status}} > container_e50_1561100493387_8111_01_002657{code} > > I think It has occurred since docker daemon is restarted. > 'docker inspect' which was run while restarting th
[jira] [Updated] (YARN-9929) NodeManager OOM because of stuck DeletionService
[ https://issues.apache.org/jira/browse/YARN-9929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-9929: --- Attachment: nm_heapdump.png > NodeManager OOM because of stuck DeletionService > > > Key: YARN-9929 > URL: https://issues.apache.org/jira/browse/YARN-9929 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.2 >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: nm_heapdump.png > > > NMs go through frequent Full GC due to a lack of heap memory. > we can find a lot of FileDeletionTask, DockerContainerDeletionTask from the > heap dump (screenshot is attached) > and after analyzing the thread dump, we can figure out _DeletionService_ gets > stuck in _executeStatusCommand_ which run 'docker inspect' > {code:java} > "DeletionService #0" - Thread t@41 >java.lang.Thread.State: RUNNABLE > at java.io.FileInputStream.readBytes(Native Method) > at java.io.FileInputStream.read(FileInputStream.java:255) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > - locked <649fc0cf> (a java.lang.UNIXProcess$ProcessPipeInputStream) > at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) > at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) > - locked <3e45c938> (a java.io.InputStreamReader) > at java.io.InputStreamReader.read(InputStreamReader.java:184) > at java.io.BufferedReader.fill(BufferedReader.java:161) > at java.io.BufferedReader.read1(BufferedReader.java:212) > at java.io.BufferedReader.read(BufferedReader.java:286) > - locked <3e45c938> (a java.io.InputStreamReader) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:1240) > at org.apache.hadoop.util.Shell.runCommand(Shell.java:995) > at org.apache.hadoop.util.Shell.run(Shell.java:902) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1227) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeDockerCommand(DockerCommandExecutor.java:91) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeStatusCommand(DockerCommandExecutor.java:180) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.getContainerStatus(DockerCommandExecutor.java:118) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.removeDockerContainer(LinuxContainerExecutor.java:937) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.deletion.task.DockerContainerDeletionTask.run(DockerContainerDeletionTask.java:61) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) >Locked ownable synchronizers: > - locked <4cc6fa2a> (a java.util.concurrent.ThreadPoolExecutor$Worker) > {code} > also, we found 'docker inspect' processes are running for a long time as > follows. > {code:java} > root 95637 0.0 0.0 2650984 35776 ? Sl Aug23 5:48 > /usr/bin/docker inspect --format={{.State.Status}} > container_e30_1555419799458_0014_01_30 > root 95638 0.0 0.0 2773860 33908 ? Sl Aug23 5:33 > /usr/bin/docker inspect --format={{.State.Status}} > container_e50_1561100493387_25316_01_001455 > root 95641 0.0 0.0 2445924 34204 ? Sl Aug23 5:34 > /usr/bin/docker inspect --format={{.State.Status}} > container_e49_1560851258686_2107_01_24 > root 95643 0.0 0.0 2642532 34428 ? Sl Aug23 5:30 > /usr/bin/docker inspect --format={{.State.Status}} > container_e50_1561100493387_8111_01_002657{code} > > I think It has occurred since docker daemon is restarted. > 'docker inspect' which was run while restarting the docker daemon was not
[jira] [Created] (YARN-9929) NodeManager OOM because of stuck DeletionService
kyungwan nam created YARN-9929: -- Summary: NodeManager OOM because of stuck DeletionService Key: YARN-9929 URL: https://issues.apache.org/jira/browse/YARN-9929 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.1.2 Reporter: kyungwan nam Assignee: kyungwan nam NMs go through frequent Full GC due to a lack of heap memory. we can find a lot of FileDeletionTask, DockerContainerDeletionTask from the heap dump (screenshot is attached) and after analyzing the thread dump, we can figure out _DeletionService_ gets stuck in _executeStatusCommand_ which run 'docker inspect' {code:java} "DeletionService #0" - Thread t@41 java.lang.Thread.State: RUNNABLE at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:255) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) - locked <649fc0cf> (a java.lang.UNIXProcess$ProcessPipeInputStream) at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) - locked <3e45c938> (a java.io.InputStreamReader) at java.io.InputStreamReader.read(InputStreamReader.java:184) at java.io.BufferedReader.fill(BufferedReader.java:161) at java.io.BufferedReader.read1(BufferedReader.java:212) at java.io.BufferedReader.read(BufferedReader.java:286) - locked <3e45c938> (a java.io.InputStreamReader) at org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:1240) at org.apache.hadoop.util.Shell.runCommand(Shell.java:995) at org.apache.hadoop.util.Shell.run(Shell.java:902) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1227) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeDockerCommand(DockerCommandExecutor.java:91) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeStatusCommand(DockerCommandExecutor.java:180) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.getContainerStatus(DockerCommandExecutor.java:118) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.removeDockerContainer(LinuxContainerExecutor.java:937) at org.apache.hadoop.yarn.server.nodemanager.containermanager.deletion.task.DockerContainerDeletionTask.run(DockerContainerDeletionTask.java:61) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - locked <4cc6fa2a> (a java.util.concurrent.ThreadPoolExecutor$Worker) {code} also, we found 'docker inspect' processes are running for a long time as follows. {code:java} root 95637 0.0 0.0 2650984 35776 ? Sl Aug23 5:48 /usr/bin/docker inspect --format={{.State.Status}} container_e30_1555419799458_0014_01_30 root 95638 0.0 0.0 2773860 33908 ? Sl Aug23 5:33 /usr/bin/docker inspect --format={{.State.Status}} container_e50_1561100493387_25316_01_001455 root 95641 0.0 0.0 2445924 34204 ? Sl Aug23 5:34 /usr/bin/docker inspect --format={{.State.Status}} container_e49_1560851258686_2107_01_24 root 95643 0.0 0.0 2642532 34428 ? Sl Aug23 5:30 /usr/bin/docker inspect --format={{.State.Status}} container_e50_1561100493387_8111_01_002657{code} I think It has occurred since docker daemon is restarted. 'docker inspect' which was run while restarting the docker daemon was not working. and not even it was not terminated. It can be considered as a docker issue. but It could happen whenever if 'docker inspect' does not work due to docker daemon restarting or docker bug. It would be good to set the timeout for 'docker inspect' to avoid this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) -
[jira] [Updated] (YARN-9905) yarn-service is failed to setup application log if app-log-dir is not default-fs
[ https://issues.apache.org/jira/browse/YARN-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-9905: --- Attachment: YARN-9905.002.patch > yarn-service is failed to setup application log if app-log-dir is not > default-fs > > > Key: YARN-9905 > URL: https://issues.apache.org/jira/browse/YARN-9905 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9905.001.patch, YARN-9905.002.patch > > > Currently, yarn-service takes a token of default namenode only. > it might cause authentication failure under HDFS federation. > how to reproduce > - kerberized cluster > - multiple namespaces by HDFS federation. > - yarn.nodemanager.remote-app-log-dir is set to a namespace that is not > default-fs > here are the nodemanager logs at that time. > {code:java} > 2019-10-15 11:52:50,217 INFO containermanager.ContainerManagerImpl > (ContainerManagerImpl.java:startContainerInternal(1122)) - Creating a new > application reference for app application_1569373267731_9571 > 2019-10-15 11:52:50,217 INFO application.ApplicationImpl > (ApplicationImpl.java:handle(655)) - Application > application_1569373267731_9571 transitioned from NEW to INITING > ... > Failed on local exception: java.io.IOException: > org.apache.hadoop.security.AccessControlException: Client cannot authenticate > via:[TOKEN, KERBEROS] > at sun.reflect.GeneratedConstructorAccessor45.newInstance(Unknown > Source) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831) > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:806) > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1515) > at org.apache.hadoop.ipc.Client.call(Client.java:1457) > at org.apache.hadoop.ipc.Client.call(Client.java:1367) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > at com.sun.proxy.$Proxy24.getFileInfo(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:900) > at sun.reflect.GeneratedMethodAccessor32.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy25.getFileInfo(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1660) > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583) > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1580) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1595) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.checkExists(LogAggregationFileController.java:396) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController$1.run(LogAggregationFileController.java:338) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.createAppDir(LogAggregationFileController.java:323) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:254) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggrega
[jira] [Created] (YARN-9905) yarn-service is failed to setup application log if app-log-dir is not default-fs
kyungwan nam created YARN-9905: -- Summary: yarn-service is failed to setup application log if app-log-dir is not default-fs Key: YARN-9905 URL: https://issues.apache.org/jira/browse/YARN-9905 Project: Hadoop YARN Issue Type: Bug Reporter: kyungwan nam Assignee: kyungwan nam Currently, yarn-service takes a token of default namenode only. it might cause authentication failure under HDFS federation. how to reproduce - kerberized cluster - multiple namespaces by HDFS federation. - yarn.nodemanager.remote-app-log-dir is set to a namespace that is not default-fs here are the nodemanager logs at that time. {code:java} 2019-10-15 11:52:50,217 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:startContainerInternal(1122)) - Creating a new application reference for app application_1569373267731_9571 2019-10-15 11:52:50,217 INFO application.ApplicationImpl (ApplicationImpl.java:handle(655)) - Application application_1569373267731_9571 transitioned from NEW to INITING ... Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] at sun.reflect.GeneratedConstructorAccessor45.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:806) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1515) at org.apache.hadoop.ipc.Client.call(Client.java:1457) at org.apache.hadoop.ipc.Client.call(Client.java:1367) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at com.sun.proxy.$Proxy24.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:900) at sun.reflect.GeneratedMethodAccessor32.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy25.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1660) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1580) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1595) at org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.checkExists(LogAggregationFileController.java:396) at org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController$1.run(LogAggregationFileController.java:338) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) at org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.createAppDir(LogAggregationFileController.java:323) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:254) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:204) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:347) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:69) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) at org.apache.hadoop.yarn.ev
[jira] [Commented] (YARN-9790) Failed to set default-application-lifetime if maximum-application-lifetime is less than or equal to zero
[ https://issues.apache.org/jira/browse/YARN-9790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16919098#comment-16919098 ] kyungwan nam commented on YARN-9790: [~Prabhu Joseph] I've attached a new patch It fixes failed test case, also some test case for this issue is added. Thanks > Failed to set default-application-lifetime if maximum-application-lifetime is > less than or equal to zero > > > Key: YARN-9790 > URL: https://issues.apache.org/jira/browse/YARN-9790 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9790.001.patch, YARN-9790.002.patch, > YARN-9790.003.patch, YARN-9790.004.patch > > > capacity-scheduler > {code} > ... > yarn.scheduler.capacity.root.dev.maximum-application-lifetime=-1 > yarn.scheduler.capacity.root.dev.default-application-lifetime=604800 > {code} > refreshQueue was failed as follows > {code} > 2019-08-28 15:21:57,423 WARN resourcemanager.AdminService > (AdminService.java:logAndWrapException(910)) - Exception refresh queues. > java.io.IOException: Failed to re-init queues : Default lifetime604800 can't > exceed maximum lifetime -1 > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:477) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:394) > at > org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:114) > at > org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:271) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678) > Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Default > lifetime604800 can't exceed maximum lifetime -1 > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.setupQueueConfigs(LeafQueue.java:268) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:162) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:141) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:259) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:283) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:171) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:726) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:472) > ... 12 more > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9790) Failed to set default-application-lifetime if maximum-application-lifetime is less than or equal to zero
[ https://issues.apache.org/jira/browse/YARN-9790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-9790: --- Attachment: YARN-9790.004.patch > Failed to set default-application-lifetime if maximum-application-lifetime is > less than or equal to zero > > > Key: YARN-9790 > URL: https://issues.apache.org/jira/browse/YARN-9790 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9790.001.patch, YARN-9790.002.patch, > YARN-9790.003.patch, YARN-9790.004.patch > > > capacity-scheduler > {code} > ... > yarn.scheduler.capacity.root.dev.maximum-application-lifetime=-1 > yarn.scheduler.capacity.root.dev.default-application-lifetime=604800 > {code} > refreshQueue was failed as follows > {code} > 2019-08-28 15:21:57,423 WARN resourcemanager.AdminService > (AdminService.java:logAndWrapException(910)) - Exception refresh queues. > java.io.IOException: Failed to re-init queues : Default lifetime604800 can't > exceed maximum lifetime -1 > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:477) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:394) > at > org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:114) > at > org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:271) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678) > Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Default > lifetime604800 can't exceed maximum lifetime -1 > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.setupQueueConfigs(LeafQueue.java:268) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:162) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:141) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:259) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:283) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:171) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:726) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:472) > ... 12 more > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9790) Failed to set default-application-lifetime if maximum-application-lifetime is less than or equal to zero
[ https://issues.apache.org/jira/browse/YARN-9790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-9790: --- Attachment: YARN-9790.003.patch > Failed to set default-application-lifetime if maximum-application-lifetime is > less than or equal to zero > > > Key: YARN-9790 > URL: https://issues.apache.org/jira/browse/YARN-9790 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9790.001.patch, YARN-9790.002.patch, > YARN-9790.003.patch > > > capacity-scheduler > {code} > ... > yarn.scheduler.capacity.root.dev.maximum-application-lifetime=-1 > yarn.scheduler.capacity.root.dev.default-application-lifetime=604800 > {code} > refreshQueue was failed as follows > {code} > 2019-08-28 15:21:57,423 WARN resourcemanager.AdminService > (AdminService.java:logAndWrapException(910)) - Exception refresh queues. > java.io.IOException: Failed to re-init queues : Default lifetime604800 can't > exceed maximum lifetime -1 > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:477) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:394) > at > org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:114) > at > org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:271) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678) > Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Default > lifetime604800 can't exceed maximum lifetime -1 > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.setupQueueConfigs(LeafQueue.java:268) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:162) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:141) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:259) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:283) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:171) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:726) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:472) > ... 12 more > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9790) Failed to set default-application-lifetime if maximum-application-lifetime is less than or equal to zero
[ https://issues.apache.org/jira/browse/YARN-9790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-9790: --- Attachment: (was: YARN-9790.003.patch) > Failed to set default-application-lifetime if maximum-application-lifetime is > less than or equal to zero > > > Key: YARN-9790 > URL: https://issues.apache.org/jira/browse/YARN-9790 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9790.001.patch, YARN-9790.002.patch, > YARN-9790.003.patch > > > capacity-scheduler > {code} > ... > yarn.scheduler.capacity.root.dev.maximum-application-lifetime=-1 > yarn.scheduler.capacity.root.dev.default-application-lifetime=604800 > {code} > refreshQueue was failed as follows > {code} > 2019-08-28 15:21:57,423 WARN resourcemanager.AdminService > (AdminService.java:logAndWrapException(910)) - Exception refresh queues. > java.io.IOException: Failed to re-init queues : Default lifetime604800 can't > exceed maximum lifetime -1 > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:477) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:394) > at > org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:114) > at > org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:271) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678) > Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Default > lifetime604800 can't exceed maximum lifetime -1 > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.setupQueueConfigs(LeafQueue.java:268) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:162) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:141) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:259) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:283) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:171) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:726) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:472) > ... 12 more > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9790) Failed to set default-application-lifetime if maximum-application-lifetime is less than or equal to zero
[ https://issues.apache.org/jira/browse/YARN-9790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16918227#comment-16918227 ] kyungwan nam commented on YARN-9790: [~Prabhu Joseph] thank you for your review and helpful comment! if maximum-lifetime is -1 or 0, it means no limit. therefore, default-lifetime should be able to be any higher lifetime. in my opinion, it should be checked as follows. {code} - if (defaultApplicationLifetime > maxApplicationLifetime) { + if (maxApplicationLifetime > 0 && + defaultApplicationLifetime > maxApplicationLifetime) { {code} and I think a fix is needed in CapacityScheduler#checkAndGetApplicationLifetime if there is no specified lifetime for an app it should respect to default-lifetime, even though maximum-lifetime is -1 or 0. CapacityScheduler#checkAndGetApplicationLifetime {code} // check only for maximum, that's enough because default can't // exceed maximum if (maximumApplicationLifetime <= 0) { -return lifetimeRequestedByApp; +return (lifetimeRequestedByApp <= 0) ? defaultApplicationLifetime : +lifetimeRequestedByApp; } {code} please let me know If you have any thought about this. thanks. > Failed to set default-application-lifetime if maximum-application-lifetime is > less than or equal to zero > > > Key: YARN-9790 > URL: https://issues.apache.org/jira/browse/YARN-9790 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9790.001.patch, YARN-9790.002.patch > > > capacity-scheduler > {code} > ... > yarn.scheduler.capacity.root.dev.maximum-application-lifetime=-1 > yarn.scheduler.capacity.root.dev.default-application-lifetime=604800 > {code} > refreshQueue was failed as follows > {code} > 2019-08-28 15:21:57,423 WARN resourcemanager.AdminService > (AdminService.java:logAndWrapException(910)) - Exception refresh queues. > java.io.IOException: Failed to re-init queues : Default lifetime604800 can't > exceed maximum lifetime -1 > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:477) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:394) > at > org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:114) > at > org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:271) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678) > Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Default > lifetime604800 can't exceed maximum lifetime -1 > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.setupQueueConfigs(LeafQueue.java:268) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:162) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:141) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:259) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:283) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:171) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:726) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:472) > ...
[jira] [Created] (YARN-9790) Failed to set default-application-lifetime if maximum-application-lifetime is less than or equal to zero
kyungwan nam created YARN-9790: -- Summary: Failed to set default-application-lifetime if maximum-application-lifetime is less than or equal to zero Key: YARN-9790 URL: https://issues.apache.org/jira/browse/YARN-9790 Project: Hadoop YARN Issue Type: Bug Reporter: kyungwan nam Assignee: kyungwan nam capacity-scheduler {code} ... yarn.scheduler.capacity.root.dev.maximum-application-lifetime=-1 yarn.scheduler.capacity.root.dev.default-application-lifetime=604800 {code} refreshQueue was failed as follows {code} 2019-08-28 15:21:57,423 WARN resourcemanager.AdminService (AdminService.java:logAndWrapException(910)) - Exception refresh queues. java.io.IOException: Failed to re-init queues : Default lifetime604800 can't exceed maximum lifetime -1 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:477) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:394) at org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:114) at org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:271) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678) Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Default lifetime604800 can't exceed maximum lifetime -1 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.setupQueueConfigs(LeafQueue.java:268) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:162) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:141) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:259) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:283) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:171) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:726) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:472) ... 12 more {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9719) Failed to restart yarn-service if it doesn’t exist in RM
[ https://issues.apache.org/jira/browse/YARN-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16904875#comment-16904875 ] kyungwan nam commented on YARN-9719: [~eyang], [~Prabhu Joseph] 007 patch was passed without failure. Could you review it? Thanks. > Failed to restart yarn-service if it doesn’t exist in RM > > > Key: YARN-9719 > URL: https://issues.apache.org/jira/browse/YARN-9719 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9719.001.patch, YARN-9719.002.patch, > YARN-9719.003.patch, YARN-9719.004.patch, YARN-9719.005.patch, > YARN-9719.006.patch, YARN-9719.007.patch > > > Sometimes, restarting a yarn-service is failed as follows. > {code} > {"diagnostics":"Application with id 'application_1562735362534_10461' doesn't > exist in RM. Please check that the job submission was successful.\n\tat > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:382)\n\tat > > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:234)\n\tat > > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:561)\n\tat > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)\n\tat > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)\n\tat > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)\n\tat > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)\n\tat > java.security.AccessController.doPrivileged(Native Method)\n\tat > javax.security.auth.Subject.doAs(Subject.java:422)\n\tat > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)\n\tat > org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)\n"} > {code} > It seems like that it occurs when restarting a yarn-service who was stopped > long ago. > by default, RM keeps up to 1000 completed applications > (yarn.resourcemanager.max-completed-applications) -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9719) Failed to restart yarn-service if it doesn’t exist in RM
[ https://issues.apache.org/jira/browse/YARN-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-9719: --- Attachment: YARN-9719.007.patch > Failed to restart yarn-service if it doesn’t exist in RM > > > Key: YARN-9719 > URL: https://issues.apache.org/jira/browse/YARN-9719 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9719.001.patch, YARN-9719.002.patch, > YARN-9719.003.patch, YARN-9719.004.patch, YARN-9719.005.patch, > YARN-9719.006.patch, YARN-9719.007.patch > > > Sometimes, restarting a yarn-service is failed as follows. > {code} > {"diagnostics":"Application with id 'application_1562735362534_10461' doesn't > exist in RM. Please check that the job submission was successful.\n\tat > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:382)\n\tat > > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:234)\n\tat > > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:561)\n\tat > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)\n\tat > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)\n\tat > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)\n\tat > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)\n\tat > java.security.AccessController.doPrivileged(Native Method)\n\tat > javax.security.auth.Subject.doAs(Subject.java:422)\n\tat > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)\n\tat > org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)\n"} > {code} > It seems like that it occurs when restarting a yarn-service who was stopped > long ago. > by default, RM keeps up to 1000 completed applications > (yarn.resourcemanager.max-completed-applications) -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9719) Failed to restart yarn-service if it doesn’t exist in RM
[ https://issues.apache.org/jira/browse/YARN-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-9719: --- Attachment: YARN-9719.006.patch > Failed to restart yarn-service if it doesn’t exist in RM > > > Key: YARN-9719 > URL: https://issues.apache.org/jira/browse/YARN-9719 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9719.001.patch, YARN-9719.002.patch, > YARN-9719.003.patch, YARN-9719.004.patch, YARN-9719.005.patch, > YARN-9719.006.patch > > > Sometimes, restarting a yarn-service is failed as follows. > {code} > {"diagnostics":"Application with id 'application_1562735362534_10461' doesn't > exist in RM. Please check that the job submission was successful.\n\tat > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:382)\n\tat > > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:234)\n\tat > > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:561)\n\tat > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)\n\tat > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)\n\tat > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)\n\tat > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)\n\tat > java.security.AccessController.doPrivileged(Native Method)\n\tat > javax.security.auth.Subject.doAs(Subject.java:422)\n\tat > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)\n\tat > org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)\n"} > {code} > It seems like that it occurs when restarting a yarn-service who was stopped > long ago. > by default, RM keeps up to 1000 completed applications > (yarn.resourcemanager.max-completed-applications) -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9719) Failed to restart yarn-service if it doesn’t exist in RM
[ https://issues.apache.org/jira/browse/YARN-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-9719: --- Attachment: (was: YARN-9719.006.patch) > Failed to restart yarn-service if it doesn’t exist in RM > > > Key: YARN-9719 > URL: https://issues.apache.org/jira/browse/YARN-9719 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9719.001.patch, YARN-9719.002.patch, > YARN-9719.003.patch, YARN-9719.004.patch, YARN-9719.005.patch > > > Sometimes, restarting a yarn-service is failed as follows. > {code} > {"diagnostics":"Application with id 'application_1562735362534_10461' doesn't > exist in RM. Please check that the job submission was successful.\n\tat > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:382)\n\tat > > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:234)\n\tat > > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:561)\n\tat > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)\n\tat > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)\n\tat > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)\n\tat > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)\n\tat > java.security.AccessController.doPrivileged(Native Method)\n\tat > javax.security.auth.Subject.doAs(Subject.java:422)\n\tat > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)\n\tat > org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)\n"} > {code} > It seems like that it occurs when restarting a yarn-service who was stopped > long ago. > by default, RM keeps up to 1000 completed applications > (yarn.resourcemanager.max-completed-applications) -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9719) Failed to restart yarn-service if it doesn’t exist in RM
[ https://issues.apache.org/jira/browse/YARN-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-9719: --- Attachment: YARN-9719.006.patch > Failed to restart yarn-service if it doesn’t exist in RM > > > Key: YARN-9719 > URL: https://issues.apache.org/jira/browse/YARN-9719 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9719.001.patch, YARN-9719.002.patch, > YARN-9719.003.patch, YARN-9719.004.patch, YARN-9719.005.patch, > YARN-9719.006.patch > > > Sometimes, restarting a yarn-service is failed as follows. > {code} > {"diagnostics":"Application with id 'application_1562735362534_10461' doesn't > exist in RM. Please check that the job submission was successful.\n\tat > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:382)\n\tat > > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:234)\n\tat > > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:561)\n\tat > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)\n\tat > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)\n\tat > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)\n\tat > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)\n\tat > java.security.AccessController.doPrivileged(Native Method)\n\tat > javax.security.auth.Subject.doAs(Subject.java:422)\n\tat > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)\n\tat > org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)\n"} > {code} > It seems like that it occurs when restarting a yarn-service who was stopped > long ago. > by default, RM keeps up to 1000 completed applications > (yarn.resourcemanager.max-completed-applications) -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9719) Failed to restart yarn-service if it doesn’t exist in RM
[ https://issues.apache.org/jira/browse/YARN-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-9719: --- Attachment: YARN-9719.005.patch > Failed to restart yarn-service if it doesn’t exist in RM > > > Key: YARN-9719 > URL: https://issues.apache.org/jira/browse/YARN-9719 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9719.001.patch, YARN-9719.002.patch, > YARN-9719.003.patch, YARN-9719.004.patch, YARN-9719.005.patch > > > Sometimes, restarting a yarn-service is failed as follows. > {code} > {"diagnostics":"Application with id 'application_1562735362534_10461' doesn't > exist in RM. Please check that the job submission was successful.\n\tat > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:382)\n\tat > > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:234)\n\tat > > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:561)\n\tat > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)\n\tat > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)\n\tat > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)\n\tat > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)\n\tat > java.security.AccessController.doPrivileged(Native Method)\n\tat > javax.security.auth.Subject.doAs(Subject.java:422)\n\tat > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)\n\tat > org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)\n"} > {code} > It seems like that it occurs when restarting a yarn-service who was stopped > long ago. > by default, RM keeps up to 1000 completed applications > (yarn.resourcemanager.max-completed-applications) -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9719) Failed to restart yarn-service if it doesn’t exist in RM
[ https://issues.apache.org/jira/browse/YARN-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902951#comment-16902951 ] kyungwan nam commented on YARN-9719: attaches a new patch, which clear the config used for completed test > Failed to restart yarn-service if it doesn’t exist in RM > > > Key: YARN-9719 > URL: https://issues.apache.org/jira/browse/YARN-9719 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9719.001.patch, YARN-9719.002.patch, > YARN-9719.003.patch, YARN-9719.004.patch > > > Sometimes, restarting a yarn-service is failed as follows. > {code} > {"diagnostics":"Application with id 'application_1562735362534_10461' doesn't > exist in RM. Please check that the job submission was successful.\n\tat > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:382)\n\tat > > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:234)\n\tat > > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:561)\n\tat > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)\n\tat > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)\n\tat > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)\n\tat > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)\n\tat > java.security.AccessController.doPrivileged(Native Method)\n\tat > javax.security.auth.Subject.doAs(Subject.java:422)\n\tat > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)\n\tat > org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)\n"} > {code} > It seems like that it occurs when restarting a yarn-service who was stopped > long ago. > by default, RM keeps up to 1000 completed applications > (yarn.resourcemanager.max-completed-applications) -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9719) Failed to restart yarn-service if it doesn’t exist in RM
[ https://issues.apache.org/jira/browse/YARN-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-9719: --- Attachment: YARN-9719.004.patch > Failed to restart yarn-service if it doesn’t exist in RM > > > Key: YARN-9719 > URL: https://issues.apache.org/jira/browse/YARN-9719 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9719.001.patch, YARN-9719.002.patch, > YARN-9719.003.patch, YARN-9719.004.patch > > > Sometimes, restarting a yarn-service is failed as follows. > {code} > {"diagnostics":"Application with id 'application_1562735362534_10461' doesn't > exist in RM. Please check that the job submission was successful.\n\tat > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:382)\n\tat > > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:234)\n\tat > > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:561)\n\tat > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)\n\tat > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)\n\tat > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)\n\tat > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)\n\tat > java.security.AccessController.doPrivileged(Native Method)\n\tat > javax.security.auth.Subject.doAs(Subject.java:422)\n\tat > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)\n\tat > org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)\n"} > {code} > It seems like that it occurs when restarting a yarn-service who was stopped > long ago. > by default, RM keeps up to 1000 completed applications > (yarn.resourcemanager.max-completed-applications) -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9719) Failed to restart yarn-service if it doesn’t exist in RM
[ https://issues.apache.org/jira/browse/YARN-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900546#comment-16900546 ] kyungwan nam commented on YARN-9719: [~Prabhu Joseph], [~eyang] Thank you for your comments. I've attached a new patch including test code. > Failed to restart yarn-service if it doesn’t exist in RM > > > Key: YARN-9719 > URL: https://issues.apache.org/jira/browse/YARN-9719 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9719.001.patch, YARN-9719.002.patch, > YARN-9719.003.patch > > > Sometimes, restarting a yarn-service is failed as follows. > {code} > {"diagnostics":"Application with id 'application_1562735362534_10461' doesn't > exist in RM. Please check that the job submission was successful.\n\tat > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:382)\n\tat > > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:234)\n\tat > > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:561)\n\tat > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)\n\tat > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)\n\tat > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)\n\tat > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)\n\tat > java.security.AccessController.doPrivileged(Native Method)\n\tat > javax.security.auth.Subject.doAs(Subject.java:422)\n\tat > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)\n\tat > org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)\n"} > {code} > It seems like that it occurs when restarting a yarn-service who was stopped > long ago. > by default, RM keeps up to 1000 completed applications > (yarn.resourcemanager.max-completed-applications) -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9719) Failed to restart yarn-service if it doesn’t exist in RM
[ https://issues.apache.org/jira/browse/YARN-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-9719: --- Attachment: YARN-9719.003.patch > Failed to restart yarn-service if it doesn’t exist in RM > > > Key: YARN-9719 > URL: https://issues.apache.org/jira/browse/YARN-9719 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9719.001.patch, YARN-9719.002.patch, > YARN-9719.003.patch > > > Sometimes, restarting a yarn-service is failed as follows. > {code} > {"diagnostics":"Application with id 'application_1562735362534_10461' doesn't > exist in RM. Please check that the job submission was successful.\n\tat > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:382)\n\tat > > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:234)\n\tat > > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:561)\n\tat > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)\n\tat > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)\n\tat > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)\n\tat > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)\n\tat > java.security.AccessController.doPrivileged(Native Method)\n\tat > javax.security.auth.Subject.doAs(Subject.java:422)\n\tat > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)\n\tat > org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)\n"} > {code} > It seems like that it occurs when restarting a yarn-service who was stopped > long ago. > by default, RM keeps up to 1000 completed applications > (yarn.resourcemanager.max-completed-applications) -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9719) Failed to restart yarn-service if it doesn’t exist in RM
[ https://issues.apache.org/jira/browse/YARN-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899711#comment-16899711 ] kyungwan nam commented on YARN-9719: [~Prabhu Joseph] Thank you for your review and comment attaches a new patch based on trunk > Failed to restart yarn-service if it doesn’t exist in RM > > > Key: YARN-9719 > URL: https://issues.apache.org/jira/browse/YARN-9719 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9719.001.patch, YARN-9719.002.patch > > > Sometimes, restarting a yarn-service is failed as follows. > {code} > {"diagnostics":"Application with id 'application_1562735362534_10461' doesn't > exist in RM. Please check that the job submission was successful.\n\tat > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:382)\n\tat > > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:234)\n\tat > > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:561)\n\tat > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)\n\tat > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)\n\tat > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)\n\tat > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)\n\tat > java.security.AccessController.doPrivileged(Native Method)\n\tat > javax.security.auth.Subject.doAs(Subject.java:422)\n\tat > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)\n\tat > org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)\n"} > {code} > It seems like that it occurs when restarting a yarn-service who was stopped > long ago. > by default, RM keeps up to 1000 completed applications > (yarn.resourcemanager.max-completed-applications) -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9719) Failed to restart yarn-service if it doesn’t exist in RM
[ https://issues.apache.org/jira/browse/YARN-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-9719: --- Attachment: YARN-9719.002.patch > Failed to restart yarn-service if it doesn’t exist in RM > > > Key: YARN-9719 > URL: https://issues.apache.org/jira/browse/YARN-9719 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9719.001.patch, YARN-9719.002.patch > > > Sometimes, restarting a yarn-service is failed as follows. > {code} > {"diagnostics":"Application with id 'application_1562735362534_10461' doesn't > exist in RM. Please check that the job submission was successful.\n\tat > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:382)\n\tat > > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:234)\n\tat > > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:561)\n\tat > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)\n\tat > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)\n\tat > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)\n\tat > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)\n\tat > java.security.AccessController.doPrivileged(Native Method)\n\tat > javax.security.auth.Subject.doAs(Subject.java:422)\n\tat > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)\n\tat > org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)\n"} > {code} > It seems like that it occurs when restarting a yarn-service who was stopped > long ago. > by default, RM keeps up to 1000 completed applications > (yarn.resourcemanager.max-completed-applications) -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9719) Failed to restart yarn-service if it doesn’t exist in RM
kyungwan nam created YARN-9719: -- Summary: Failed to restart yarn-service if it doesn’t exist in RM Key: YARN-9719 URL: https://issues.apache.org/jira/browse/YARN-9719 Project: Hadoop YARN Issue Type: Bug Components: yarn-native-services Reporter: kyungwan nam Assignee: kyungwan nam Sometimes, restarting a yarn-service is failed as follows. {code} {"diagnostics":"Application with id 'application_1562735362534_10461' doesn't exist in RM. Please check that the job submission was successful.\n\tat org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:382)\n\tat org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:234)\n\tat org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:561)\n\tat org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)\n\tat org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)\n\tat org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)\n\tat org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)\n\tat java.security.AccessController.doPrivileged(Native Method)\n\tat javax.security.auth.Subject.doAs(Subject.java:422)\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)\n\tat org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)\n"} {code} It seems like that it occurs when restarting a yarn-service who was stopped long ago. by default, RM keeps up to 1000 completed applications (yarn.resourcemanager.max-completed-applications) -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9703) Failed to cancel yarn service upgrade when canceling multiple times
[ https://issues.apache.org/jira/browse/YARN-9703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam reassigned YARN-9703: -- Assignee: kyungwan nam Attachment: YARN-9703.001.patch I've attached a patch that fixes it please review or comment thanks > Failed to cancel yarn service upgrade when canceling multiple times > --- > > Key: YARN-9703 > URL: https://issues.apache.org/jira/browse/YARN-9703 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9703.001.patch > > > sleeptest.yarnfile > {code:java} > { >"name":"sleeptest", >"version":"1.0.0", >"lifetime":"-1", >"components":[ > { > "name":"sleep", > "number_of_containers":3, > … > } > {code} > how to reproduce > * initiate upgrade > * upgrade instance sleep-0 > * cancel upgrade -> it succeeded without any problem > * initiate upgrade > * upgrade instance sleep-0 > * cancel upgrade -> it didn’t work. at that time, AM logs are as follows. > {code:java} > 2019-07-26 10:12:20,057 [Component dispatcher] INFO > instance.ComponentInstance - container_e72_1564103075282_0002_01_04 > pending cancellation > 2019-07-26 10:12:20,057 [Component dispatcher] INFO > instance.ComponentInstance - [COMPINSTANCE sleep-2 : > container_e72_1564103075282_0002_01_04] Transitioned from READY to > CANCEL_UPGRADING on CANCEL_UPGRADE event > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9703) Failed to cancel yarn service upgrade when canceling multiple times
kyungwan nam created YARN-9703: -- Summary: Failed to cancel yarn service upgrade when canceling multiple times Key: YARN-9703 URL: https://issues.apache.org/jira/browse/YARN-9703 Project: Hadoop YARN Issue Type: Bug Components: yarn-native-services Reporter: kyungwan nam sleeptest.yarnfile {code:java} { "name":"sleeptest", "version":"1.0.0", "lifetime":"-1", "components":[ { "name":"sleep", "number_of_containers":3, … } {code} how to reproduce * initiate upgrade * upgrade instance sleep-0 * cancel upgrade -> it succeeded without any problem * initiate upgrade * upgrade instance sleep-0 * cancel upgrade -> it didn’t work. at that time, AM logs are as follows. {code:java} 2019-07-26 10:12:20,057 [Component dispatcher] INFO instance.ComponentInstance - container_e72_1564103075282_0002_01_04 pending cancellation 2019-07-26 10:12:20,057 [Component dispatcher] INFO instance.ComponentInstance - [COMPINSTANCE sleep-2 : container_e72_1564103075282_0002_01_04] Transitioned from READY to CANCEL_UPGRADING on CANCEL_UPGRADE event {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9691) canceling upgrade does not work if upgrade failed container is existing
[ https://issues.apache.org/jira/browse/YARN-9691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-9691: --- Attachment: YARN-9691.002.patch > canceling upgrade does not work if upgrade failed container is existing > --- > > Key: YARN-9691 > URL: https://issues.apache.org/jira/browse/YARN-9691 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9691.001.patch, YARN-9691.002.patch > > > if a container is failed to upgrade during yarn service upgrade, it will be > released container and transition to FAILED_UPGRADE state. > After then, I expected it is able to be back to the previous version using > cancel-upgrade. but, It didn’t work. > At that time, AM log is as follows > {code} > # failed to upgrade container_e62_1563179597798_0006_01_08 > 2019-07-16 18:21:55,152 [IPC Server handler 0 on 39483] INFO > service.ClientAMService - Upgrade container > container_e62_1563179597798_0006_01_08 > 2019-07-16 18:21:55,153 [Component dispatcher] INFO > instance.ComponentInstance - [COMPINSTANCE sleep-0 : > container_e62_1563179597798_0006_01_08] spec state state changed from > NEEDS_UPGRADE -> UPGRADING > 2019-07-16 18:21:55,154 [Component dispatcher] INFO > instance.ComponentInstance - [COMPINSTANCE sleep-0 : > container_e62_1563179597798_0006_01_08] Transitioned from READY to > UPGRADING on UPGRADE event > 2019-07-16 18:21:55,154 [pool-5-thread-4] INFO > registry.YarnRegistryViewForProviders - [COMPINSTANCE sleep-0 : > container_e62_1563179597798_0006_01_08]: Deleting registry path > /users/test/services/yarn-service/sleeptest/components/ctr-e62-1563179597798-0006-01-08 > 2019-07-16 18:21:55,156 [pool-6-thread-6] INFO provider.ProviderUtils - > [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_08] version > 1.0.1 : Creating dir on hdfs: > hdfs://test1.com:8020/user/test/.yarn/services/sleeptest/components/1.0.1/sleep/sleep-0 > 2019-07-16 18:21:55,157 [pool-6-thread-6] INFO > containerlaunch.ContainerLaunchService - reInitializing container > container_e62_1563179597798_0006_01_08 with version 1.0.1 > 2019-07-16 18:21:55,157 [pool-6-thread-6] INFO > containerlaunch.AbstractLauncher - yarn docker env var has been set > {LANGUAGE=en_US.UTF-8, HADOOP_USER_NAME=test, > YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_HOSTNAME=sleep-0.sleeptest.test.EXAMPLE.COM, > WORK_DIR=$PWD, LC_ALL=en_US.UTF-8, YARN_CONTAINER_RUNTIME_TYPE=docker, > YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=registry.test.com/test/sleep1:latest, > LANG=en_US.UTF-8, YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=bridge, > YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE=true, LOG_DIR=} > 2019-07-16 18:21:55,158 > [org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #7] INFO > impl.NMClientAsyncImpl - Processing Event EventType: REINITIALIZE_CONTAINER > for Container container_e62_1563179597798_0006_01_08 > 2019-07-16 18:21:55,167 [Component dispatcher] INFO > instance.ComponentInstance - [COMPINSTANCE sleep-0 : > container_e62_1563179597798_0006_01_08] spec state state changed from > UPGRADING -> RUNNING_BUT_UNREADY > 2019-07-16 18:21:55,167 [Component dispatcher] INFO > instance.ComponentInstance - [COMPINSTANCE sleep-0 : > container_e62_1563179597798_0006_01_08] retrieve status after 30 > 2019-07-16 18:21:55,167 [Component dispatcher] INFO > instance.ComponentInstance - [COMPINSTANCE sleep-0 : > container_e62_1563179597798_0006_01_08] Transitioned from UPGRADING to > REINITIALIZED on START event > 2019-07-16 18:22:07,797 [pool-7-thread-1] INFO monitor.ServiceMonitor - > Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:22:07 > KST 2019", outcome="failure", message="Failure in Default probe: IP > presence", exception="java.io.IOException: sleep-0: IP is not available yet" > 2019-07-16 18:22:37,797 [pool-7-thread-1] INFO monitor.ServiceMonitor - > Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:22:37 > KST 2019", outcome="failure", message="Failure in Default probe: IP > presence", exception="java.io.IOException: sleep-0: IP is not available yet" > 2019-07-16 18:23:07,797 [pool-7-thread-1] INFO monitor.ServiceMonitor - > Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:23:07 > KST 2019", outcome="failure", message="Failure in Default probe: IP > presence", exception="java.io.IOException: sleep-0: IP is not available yet" > 2019-07-16 18:23:08,225 [Component dispatcher] INFO > instance.ComponentInstance - [COMPINSTANCE sleep-0 : > container_e62_1563179597798_0006_01_08] spec state state changed from > RUNNING_BUT_UNREADY -> FAILED_UPGRADE > # re
[jira] [Created] (YARN-9691) canceling upgrade does not work if upgrade failed container is existing
kyungwan nam created YARN-9691: -- Summary: canceling upgrade does not work if upgrade failed container is existing Key: YARN-9691 URL: https://issues.apache.org/jira/browse/YARN-9691 Project: Hadoop YARN Issue Type: Bug Reporter: kyungwan nam Assignee: kyungwan nam if a container is failed to upgrade during yarn service upgrade, it will be released container and transition to FAILED_UPGRADE state. After then, I expected it is able to be back to the previous version using cancel-upgrade. but, It didn’t work. At that time, AM log is as follows {code} # failed to upgrade container_e62_1563179597798_0006_01_08 2019-07-16 18:21:55,152 [IPC Server handler 0 on 39483] INFO service.ClientAMService - Upgrade container container_e62_1563179597798_0006_01_08 2019-07-16 18:21:55,153 [Component dispatcher] INFO instance.ComponentInstance - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_08] spec state state changed from NEEDS_UPGRADE -> UPGRADING 2019-07-16 18:21:55,154 [Component dispatcher] INFO instance.ComponentInstance - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_08] Transitioned from READY to UPGRADING on UPGRADE event 2019-07-16 18:21:55,154 [pool-5-thread-4] INFO registry.YarnRegistryViewForProviders - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_08]: Deleting registry path /users/test/services/yarn-service/sleeptest/components/ctr-e62-1563179597798-0006-01-08 2019-07-16 18:21:55,156 [pool-6-thread-6] INFO provider.ProviderUtils - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_08] version 1.0.1 : Creating dir on hdfs: hdfs://test1.com:8020/user/test/.yarn/services/sleeptest/components/1.0.1/sleep/sleep-0 2019-07-16 18:21:55,157 [pool-6-thread-6] INFO containerlaunch.ContainerLaunchService - reInitializing container container_e62_1563179597798_0006_01_08 with version 1.0.1 2019-07-16 18:21:55,157 [pool-6-thread-6] INFO containerlaunch.AbstractLauncher - yarn docker env var has been set {LANGUAGE=en_US.UTF-8, HADOOP_USER_NAME=test, YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_HOSTNAME=sleep-0.sleeptest.test.EXAMPLE.COM, WORK_DIR=$PWD, LC_ALL=en_US.UTF-8, YARN_CONTAINER_RUNTIME_TYPE=docker, YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=registry.test.com/test/sleep1:latest, LANG=en_US.UTF-8, YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=bridge, YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE=true, LOG_DIR=} 2019-07-16 18:21:55,158 [org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #7] INFO impl.NMClientAsyncImpl - Processing Event EventType: REINITIALIZE_CONTAINER for Container container_e62_1563179597798_0006_01_08 2019-07-16 18:21:55,167 [Component dispatcher] INFO instance.ComponentInstance - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_08] spec state state changed from UPGRADING -> RUNNING_BUT_UNREADY 2019-07-16 18:21:55,167 [Component dispatcher] INFO instance.ComponentInstance - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_08] retrieve status after 30 2019-07-16 18:21:55,167 [Component dispatcher] INFO instance.ComponentInstance - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_08] Transitioned from UPGRADING to REINITIALIZED on START event 2019-07-16 18:22:07,797 [pool-7-thread-1] INFO monitor.ServiceMonitor - Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:22:07 KST 2019", outcome="failure", message="Failure in Default probe: IP presence", exception="java.io.IOException: sleep-0: IP is not available yet" 2019-07-16 18:22:37,797 [pool-7-thread-1] INFO monitor.ServiceMonitor - Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:22:37 KST 2019", outcome="failure", message="Failure in Default probe: IP presence", exception="java.io.IOException: sleep-0: IP is not available yet" 2019-07-16 18:23:07,797 [pool-7-thread-1] INFO monitor.ServiceMonitor - Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:23:07 KST 2019", outcome="failure", message="Failure in Default probe: IP presence", exception="java.io.IOException: sleep-0: IP is not available yet" 2019-07-16 18:23:08,225 [Component dispatcher] INFO instance.ComponentInstance - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_08] spec state state changed from RUNNING_BUT_UNREADY -> FAILED_UPGRADE # request canceling upgrade 2019-07-16 18:28:22,713 [Component dispatcher] INFO service.ServiceManager - Upgrade container container_e62_1563179597798_0006_01_04 true 2019-07-16 18:28:22,713 [Component dispatcher] INFO service.ServiceManager - Upgrade container container_e62_1563179597798_0006_01_03 true 2019-07-16 18:28:22,713 [Component dispatcher] INFO service.ServiceManager - Upgrade container container_e62_1563179597798_0006_01_
[jira] [Commented] (YARN-9682) Wrong log message when finalizing the upgrade
[ https://issues.apache.org/jira/browse/YARN-9682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886641#comment-16886641 ] kyungwan nam commented on YARN-9682: [~cheersyang] thank you for your review and comment > Wrong log message when finalizing the upgrade > - > > Key: YARN-9682 > URL: https://issues.apache.org/jira/browse/YARN-9682 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Trivial > Fix For: 3.3.0 > > Attachments: YARN-9682.001.patch > > > I've seen the wrong message as follows when finalize-upgrade for a > yarn-service > {code:java} > 2019-07-16 17:44:09,204 INFO client.ServiceClient > (ServiceClient.java:actionStartAndGetId(1193)) - Finalize service {} > upgrade{code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9682) wrong log message when finalize upgrade
[ https://issues.apache.org/jira/browse/YARN-9682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam reassigned YARN-9682: -- Assignee: kyungwan nam Attachment: YARN-9682.001.patch > wrong log message when finalize upgrade > --- > > Key: YARN-9682 > URL: https://issues.apache.org/jira/browse/YARN-9682 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Trivial > Attachments: YARN-9682.001.patch > > > I've seen the wrong message as follows when finalize-upgrade for a > yarn-service > {code:java} > 2019-07-16 17:44:09,204 INFO client.ServiceClient > (ServiceClient.java:actionStartAndGetId(1193)) - Finalize service {} > upgrade{code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9682) wrong log message when finalize upgrade
kyungwan nam created YARN-9682: -- Summary: wrong log message when finalize upgrade Key: YARN-9682 URL: https://issues.apache.org/jira/browse/YARN-9682 Project: Hadoop YARN Issue Type: Bug Reporter: kyungwan nam I've seen the wrong message as follows when finalize-upgrade for a yarn-service {code:java} 2019-07-16 17:44:09,204 INFO client.ServiceClient (ServiceClient.java:actionStartAndGetId(1193)) - Finalize service {} upgrade{code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9521) RM failed to start due to system services
[ https://issues.apache.org/jira/browse/YARN-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876007#comment-16876007 ] kyungwan nam commented on YARN-9521: I attached a new patch which ApiServiceClient.actionCleanUp will be performed with ugi.doAs() > RM failed to start due to system services > - > > Key: YARN-9521 > URL: https://issues.apache.org/jira/browse/YARN-9521 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.2 >Reporter: kyungwan nam >Priority: Major > Attachments: YARN-9521.001.patch, YARN-9521.002.patch > > > when starting RM, listing system services directory has failed as follows. > {code} > 2019-04-30 17:18:25,441 INFO client.SystemServiceManagerImpl > (SystemServiceManagerImpl.java:serviceInit(114)) - System Service Directory > is configured to /services > 2019-04-30 17:18:25,467 INFO client.SystemServiceManagerImpl > (SystemServiceManagerImpl.java:serviceInit(120)) - UserGroupInformation > initialized to yarn (auth:SIMPLE) > 2019-04-30 17:18:25,467 INFO service.AbstractService > (AbstractService.java:noteFailure(267)) - Service ResourceManager failed in > state STARTED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > Filesystem closed > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:203) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:869) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1228) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1269) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1265) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1265) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1316) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1501) > Caused by: java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:473) > at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1639) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1217) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1233) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1200) > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1179) > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1175) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1187) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.list(SystemServiceManagerImpl.java:375) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.scanForUserServices(SystemServiceManagerImpl.java:282) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.serviceStart(SystemServiceManagerImpl.java:126) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > ... 13 more > {code} > it looks like due to the usage of filesystem cache. > this issue does not happen, when I add "fs.hdfs.impl.disable.cache=true" to > yarn-site -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9521) RM failed to start due to system services
[ https://issues.apache.org/jira/browse/YARN-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876006#comment-16876006 ] kyungwan nam commented on YARN-9521: after some further digging, I think that I figure out the cause of this issue more correctly. normally, when yarn-service API is requested, a new ugi is created and it is performed inside of the ugi.doAs() when calling FileSystem.get() inside of the ugi.doAs(), it always create a new FileSystem. because the ugi is used for the key of the FileSystem.CACHE. (YARN-3336 would be helpful to understand this) so in this case, does not close a FileSystem from the FileSystem.CACHE {code} UserGroupInformation ugi = getProxyUser(request); LOG.info("POST: createService = {} user = {}", service, ugi); if(service.getState()==ServiceState.STOPPED) { ugi.doAs(new PrivilegedExceptionAction() { @Override public Void run() throws YarnException, IOException { ServiceClient sc = getServiceClient(); try { sc.init(YARN_CONFIG); sc.start(); sc.actionBuild(service); } finally { sc.close(); } return null; } }); {code} on the other hand, ApiServiceClient.actionCleanUp which is called at RMAppImpl.appAdminClientCleanUp is performed as the RM loginUser instead of doAs() in this case, FileSystem.get() can return cached one which SystemServiceManagerImpl, FileSystemNodeLabelsStore refer {code} @Override public int actionCleanUp(String appName, String userName) throws IOException, YarnException { ServiceClient sc = new ServiceClient(); sc.init(getConfig()); sc.start(); int result = sc.actionCleanUp(appName, userName); sc.close(); return result; } {code} > RM failed to start due to system services > - > > Key: YARN-9521 > URL: https://issues.apache.org/jira/browse/YARN-9521 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.2 >Reporter: kyungwan nam >Priority: Major > Attachments: YARN-9521.001.patch, YARN-9521.002.patch > > > when starting RM, listing system services directory has failed as follows. > {code} > 2019-04-30 17:18:25,441 INFO client.SystemServiceManagerImpl > (SystemServiceManagerImpl.java:serviceInit(114)) - System Service Directory > is configured to /services > 2019-04-30 17:18:25,467 INFO client.SystemServiceManagerImpl > (SystemServiceManagerImpl.java:serviceInit(120)) - UserGroupInformation > initialized to yarn (auth:SIMPLE) > 2019-04-30 17:18:25,467 INFO service.AbstractService > (AbstractService.java:noteFailure(267)) - Service ResourceManager failed in > state STARTED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > Filesystem closed > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:203) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:869) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1228) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1269) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1265) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1265) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1316) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1501) > Caused by: java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:473) > at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1639) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1217) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1233)
[jira] [Updated] (YARN-9521) RM failed to start due to system services
[ https://issues.apache.org/jira/browse/YARN-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-9521: --- Attachment: YARN-9521.002.patch > RM failed to start due to system services > - > > Key: YARN-9521 > URL: https://issues.apache.org/jira/browse/YARN-9521 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.2 >Reporter: kyungwan nam >Priority: Major > Attachments: YARN-9521.001.patch, YARN-9521.002.patch > > > when starting RM, listing system services directory has failed as follows. > {code} > 2019-04-30 17:18:25,441 INFO client.SystemServiceManagerImpl > (SystemServiceManagerImpl.java:serviceInit(114)) - System Service Directory > is configured to /services > 2019-04-30 17:18:25,467 INFO client.SystemServiceManagerImpl > (SystemServiceManagerImpl.java:serviceInit(120)) - UserGroupInformation > initialized to yarn (auth:SIMPLE) > 2019-04-30 17:18:25,467 INFO service.AbstractService > (AbstractService.java:noteFailure(267)) - Service ResourceManager failed in > state STARTED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > Filesystem closed > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:203) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:869) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1228) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1269) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1265) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1265) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1316) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1501) > Caused by: java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:473) > at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1639) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1217) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1233) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1200) > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1179) > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1175) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1187) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.list(SystemServiceManagerImpl.java:375) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.scanForUserServices(SystemServiceManagerImpl.java:282) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.serviceStart(SystemServiceManagerImpl.java:126) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > ... 13 more > {code} > it looks like due to the usage of filesystem cache. > this issue does not happen, when I add "fs.hdfs.impl.disable.cache=true" to > yarn-site -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9521) RM failed to start due to system services
[ https://issues.apache.org/jira/browse/YARN-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16868310#comment-16868310 ] kyungwan nam commented on YARN-9521: {code:java} 2019-06-18 18:47:38,634 INFO nodelabels.CommonNodeLabelsManager (CommonNodeLabelsManager.java:internalUpdateLabelsOnNodes(664)) - REPLACE labels on nodes: 2019-06-18 18:47:38,634 INFO nodelabels.CommonNodeLabelsManager (CommonNodeLabelsManager.java:internalUpdateLabelsOnNodes(666)) - NM=test.nm1.com:0, labels=[test] 2019-06-18 18:47:38,635 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1560841031202_0111_01 container=null queue=dev clusterResource= type=OFF_SWITCH requestedPartition= 2019-06-18 18:47:38,635 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1560841031202_0111_01 container=null queue=dev clusterResource= type=OFF_SWITCH requestedPartition= 2019-06-18 18:47:38,635 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1560841031202_0111_01 container=null queue=dev clusterResource= type=OFF_SWITCH requestedPartition= 2019-06-18 18:47:38,635 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1560841031202_0111_01 container=null queue=dev clusterResource= type=OFF_SWITCH requestedPartition= 2019-06-18 18:47:38,636 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(480)) - container_e48_1560841031202_0111_01_002020 Container Transitioned from NEW to ALLOCATED 2019-06-18 18:47:38,636 ERROR nodelabels.CommonNodeLabelsManager (CommonNodeLabelsManager.java:handleStoreEvent(201)) - Failed to store label modification to storage 2019-06-18 18:47:38,637 INFO fica.FiCaSchedulerNode (FiCaSchedulerNode.java:allocateContainer(169)) - Assigned container container_e48_1560841031202_0111_01_002020 of capacity on host test.nm3.com:8454, which has 3 containers, used and available after allocation 2019-06-18 18:47:38,637 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) - Error in dispatcher thread org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Filesystem closed at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.handleStoreEvent(CommonNodeLabelsManager.java:202) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:174) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:169) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:473) at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1412) at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1383) at org.apache.hadoop.hdfs.DistributedFileSystem$5.doCall(DistributedFileSystem.java:427) at org.apache.hadoop.hdfs.DistributedFileSystem$5.doCall(DistributedFileSystem.java:423) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:435) at org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:404) at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1379) at org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.ensureAppendEditlogFile(FileSystemNodeLabelsStore.java:107) at org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.updateNodeToLabelsMappings(FileSystemNodeLabelsStore.java:118) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.handleStoreEvent(CommonNodeLabelsManager.java:196) ... 5 more 2019-06-18 18:47:38,637 INFO capacity.ParentQueue (ParentQueue.java:apply(1340)) - assignedContainer queue=root usedCapacity=0.08724866 absoluteUsedCapacity=0.08724866 used= cluster= 2019-06-18 18:47:38,637 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2894)) - Allocation proposal accepted 2019-06-18 18:47:38,637 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2900)) - Failed to accept allocation proposal 2019-06-18 18:47:38,637 INFO capacity.CapacityScheduler (CapacitySched
[jira] [Commented] (YARN-9386) destroying yarn-service is allowed even though running state
[ https://issues.apache.org/jira/browse/YARN-9386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16864996#comment-16864996 ] kyungwan nam commented on YARN-9386: [~billie.rinaldi], [~wangda] Sorry for bothering you... Could you please review this when you are available? Thanks :) > destroying yarn-service is allowed even though running state > > > Key: YARN-9386 > URL: https://issues.apache.org/jira/browse/YARN-9386 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9386.001.patch, YARN-9386.002.patch, > YARN-9386.003.patch > > > It looks very dangerous to destroy a running app. It should not be allowed. > {code} > [yarn-ats@test ~]$ yarn app -list > 19/03/12 17:48:49 INFO client.RMProxy: Connecting to ResourceManager at > test1.com/10.1.1.11:8050 > 19/03/12 17:48:50 INFO client.AHSProxy: Connecting to Application History > server at test1.com/10.1.1.101:10200 > Total number of applications (application-types: [], states: [SUBMITTED, > ACCEPTED, RUNNING] and tags: []):3 > Application-Id Application-NameApplication-Type > User Queue State Final-State >ProgressTracking-URL > application_1551250841677_0003fbyarn-service >ambari-qa default RUNNING UNDEFINED >100% N/A > application_1552379723611_0002 fb1yarn-service > yarn-ats default RUNNING UNDEFINED >100% N/A > application_1550801435420_0001 ats-hbaseyarn-service > yarn-ats default RUNNING UNDEFINED >100% N/A > [yarn-ats@test ~]$ yarn app -destroy fb1 > 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at > test1.com/10.1.1.11:8050 > 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History > server at test1.com/10.1.1.101:10200 > 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at > test1.com/10.1.1.11:8050 > 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History > server at test1.com/10.1.1.101:10200 > 19/03/12 17:49:02 INFO util.log: Logging initialized @1637ms > 19/03/12 17:49:07 INFO client.ApiServiceClient: Successfully destroyed > service fb1 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9628) incorrect ‘number of containers’ is written when decommission for non-existing component instance
[ https://issues.apache.org/jira/browse/YARN-9628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam reassigned YARN-9628: -- Assignee: kyungwan nam > incorrect ‘number of containers’ is written when decommission for > non-existing component instance > - > > Key: YARN-9628 > URL: https://issues.apache.org/jira/browse/YARN-9628 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9628.001.patch > > > Decommission for component instances is introduced in YARN-8761. > Currently, decommission is succeeded even though the component instance does > not exist. > As a result, incorrect ‘number of containers’ would be written to the service > spec file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9628) incorrect ‘number of containers’ is written when decommission for non-existing component instance
[ https://issues.apache.org/jira/browse/YARN-9628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-9628: --- Attachment: YARN-9628.001.patch > incorrect ‘number of containers’ is written when decommission for > non-existing component instance > - > > Key: YARN-9628 > URL: https://issues.apache.org/jira/browse/YARN-9628 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: kyungwan nam >Priority: Major > Attachments: YARN-9628.001.patch > > > Decommission for component instances is introduced in YARN-8761. > Currently, decommission is succeeded even though the component instance does > not exist. > As a result, incorrect ‘number of containers’ would be written to the service > spec file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9628) incorrect ‘number of containers’ is written when decommission for non-existing component instance
kyungwan nam created YARN-9628: -- Summary: incorrect ‘number of containers’ is written when decommission for non-existing component instance Key: YARN-9628 URL: https://issues.apache.org/jira/browse/YARN-9628 Project: Hadoop YARN Issue Type: Bug Components: yarn-native-services Reporter: kyungwan nam Decommission for component instances is introduced in YARN-8761. Currently, decommission is succeeded even though the component instance does not exist. As a result, incorrect ‘number of containers’ would be written to the service spec file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9386) destroying yarn-service is allowed even though running state
[ https://issues.apache.org/jira/browse/YARN-9386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856300#comment-16856300 ] kyungwan nam commented on YARN-9386: [~billie.rinaldi], I've attached a new patch including your suggestion. Thanks > destroying yarn-service is allowed even though running state > > > Key: YARN-9386 > URL: https://issues.apache.org/jira/browse/YARN-9386 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9386.001.patch, YARN-9386.002.patch, > YARN-9386.003.patch > > > It looks very dangerous to destroy a running app. It should not be allowed. > {code} > [yarn-ats@test ~]$ yarn app -list > 19/03/12 17:48:49 INFO client.RMProxy: Connecting to ResourceManager at > test1.com/10.1.1.11:8050 > 19/03/12 17:48:50 INFO client.AHSProxy: Connecting to Application History > server at test1.com/10.1.1.101:10200 > Total number of applications (application-types: [], states: [SUBMITTED, > ACCEPTED, RUNNING] and tags: []):3 > Application-Id Application-NameApplication-Type > User Queue State Final-State >ProgressTracking-URL > application_1551250841677_0003fbyarn-service >ambari-qa default RUNNING UNDEFINED >100% N/A > application_1552379723611_0002 fb1yarn-service > yarn-ats default RUNNING UNDEFINED >100% N/A > application_1550801435420_0001 ats-hbaseyarn-service > yarn-ats default RUNNING UNDEFINED >100% N/A > [yarn-ats@test ~]$ yarn app -destroy fb1 > 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at > test1.com/10.1.1.11:8050 > 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History > server at test1.com/10.1.1.101:10200 > 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at > test1.com/10.1.1.11:8050 > 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History > server at test1.com/10.1.1.101:10200 > 19/03/12 17:49:02 INFO util.log: Logging initialized @1637ms > 19/03/12 17:49:07 INFO client.ApiServiceClient: Successfully destroyed > service fb1 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9386) destroying yarn-service is allowed even though running state
[ https://issues.apache.org/jira/browse/YARN-9386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-9386: --- Attachment: YARN-9386.003.patch > destroying yarn-service is allowed even though running state > > > Key: YARN-9386 > URL: https://issues.apache.org/jira/browse/YARN-9386 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9386.001.patch, YARN-9386.002.patch, > YARN-9386.003.patch > > > It looks very dangerous to destroy a running app. It should not be allowed. > {code} > [yarn-ats@test ~]$ yarn app -list > 19/03/12 17:48:49 INFO client.RMProxy: Connecting to ResourceManager at > test1.com/10.1.1.11:8050 > 19/03/12 17:48:50 INFO client.AHSProxy: Connecting to Application History > server at test1.com/10.1.1.101:10200 > Total number of applications (application-types: [], states: [SUBMITTED, > ACCEPTED, RUNNING] and tags: []):3 > Application-Id Application-NameApplication-Type > User Queue State Final-State >ProgressTracking-URL > application_1551250841677_0003fbyarn-service >ambari-qa default RUNNING UNDEFINED >100% N/A > application_1552379723611_0002 fb1yarn-service > yarn-ats default RUNNING UNDEFINED >100% N/A > application_1550801435420_0001 ats-hbaseyarn-service > yarn-ats default RUNNING UNDEFINED >100% N/A > [yarn-ats@test ~]$ yarn app -destroy fb1 > 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at > test1.com/10.1.1.11:8050 > 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History > server at test1.com/10.1.1.101:10200 > 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at > test1.com/10.1.1.11:8050 > 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History > server at test1.com/10.1.1.101:10200 > 19/03/12 17:49:02 INFO util.log: Logging initialized @1637ms > 19/03/12 17:49:07 INFO client.ApiServiceClient: Successfully destroyed > service fb1 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9386) destroying yarn-service is allowed even though running state
[ https://issues.apache.org/jira/browse/YARN-9386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852719#comment-16852719 ] kyungwan nam commented on YARN-9386: Thank you for your comment! [~billie.rinaldi] I agreed with you. I will upload it shortly. [~wangda] Yes, only owner or admin can do operations like start/stop/destroy as you said. It is not about granular permission. stopped service can be restarted with existing configuration whenever we want. unlike stop, destroy is irreversible. once destroy is requested, it will delete permanently. when a running service is destroyed by mistake, it is not possible to recover. that’s the dangerous thing I’m thinking. so, destroy should be allowed for stopped service only. > destroying yarn-service is allowed even though running state > > > Key: YARN-9386 > URL: https://issues.apache.org/jira/browse/YARN-9386 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9386.001.patch, YARN-9386.002.patch > > > It looks very dangerous to destroy a running app. It should not be allowed. > {code} > [yarn-ats@test ~]$ yarn app -list > 19/03/12 17:48:49 INFO client.RMProxy: Connecting to ResourceManager at > test1.com/10.1.1.11:8050 > 19/03/12 17:48:50 INFO client.AHSProxy: Connecting to Application History > server at test1.com/10.1.1.101:10200 > Total number of applications (application-types: [], states: [SUBMITTED, > ACCEPTED, RUNNING] and tags: []):3 > Application-Id Application-NameApplication-Type > User Queue State Final-State >ProgressTracking-URL > application_1551250841677_0003fbyarn-service >ambari-qa default RUNNING UNDEFINED >100% N/A > application_1552379723611_0002 fb1yarn-service > yarn-ats default RUNNING UNDEFINED >100% N/A > application_1550801435420_0001 ats-hbaseyarn-service > yarn-ats default RUNNING UNDEFINED >100% N/A > [yarn-ats@test ~]$ yarn app -destroy fb1 > 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at > test1.com/10.1.1.11:8050 > 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History > server at test1.com/10.1.1.101:10200 > 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at > test1.com/10.1.1.11:8050 > 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History > server at test1.com/10.1.1.101:10200 > 19/03/12 17:49:02 INFO util.log: Logging initialized @1637ms > 19/03/12 17:49:07 INFO client.ApiServiceClient: Successfully destroyed > service fb1 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9521) RM filed to start due to system services
[ https://issues.apache.org/jira/browse/YARN-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845571#comment-16845571 ] kyungwan nam commented on YARN-9521: Please let me know if anyone has any ideas on how to resolve. Thanks. > RM filed to start due to system services > > > Key: YARN-9521 > URL: https://issues.apache.org/jira/browse/YARN-9521 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.2 >Reporter: kyungwan nam >Priority: Major > Attachments: YARN-9521.001.patch > > > when starting RM, listing system services directory has failed as follows. > {code} > 2019-04-30 17:18:25,441 INFO client.SystemServiceManagerImpl > (SystemServiceManagerImpl.java:serviceInit(114)) - System Service Directory > is configured to /services > 2019-04-30 17:18:25,467 INFO client.SystemServiceManagerImpl > (SystemServiceManagerImpl.java:serviceInit(120)) - UserGroupInformation > initialized to yarn (auth:SIMPLE) > 2019-04-30 17:18:25,467 INFO service.AbstractService > (AbstractService.java:noteFailure(267)) - Service ResourceManager failed in > state STARTED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > Filesystem closed > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:203) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:869) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1228) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1269) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1265) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1265) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1316) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1501) > Caused by: java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:473) > at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1639) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1217) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1233) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1200) > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1179) > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1175) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1187) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.list(SystemServiceManagerImpl.java:375) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.scanForUserServices(SystemServiceManagerImpl.java:282) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.serviceStart(SystemServiceManagerImpl.java:126) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > ... 13 more > {code} > it looks like due to the usage of filesystem cache. > this issue does not happen, when I add "fs.hdfs.impl.disable.cache=true" to > yarn-site -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9521) RM filed to start due to system services
[ https://issues.apache.org/jira/browse/YARN-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16840096#comment-16840096 ] kyungwan nam commented on YARN-9521: I think the cause of this problem is as follows. 1. _fs_ is set by calling FileSystem.get() on SystemServiceManagerImpl.serviceInit 2. RMAppImpl.appAdminClientCleanUp will be called on RMAppImpl.FinalTransition, if APP_COMPLETED event occurs during RMStateStore recovery {code} static void appAdminClientCleanUp(RMAppImpl app) { try { AppAdminClient client = AppAdminClient.createAppAdminClient(app .applicationType, app.conf); int result = client.actionCleanUp(app.name, app.user); {code} ApiServiceClient.actionCleanUp {code} @Override public int actionCleanUp(String appName, String userName) throws IOException, YarnException { ServiceClient sc = new ServiceClient(); sc.init(getConfig()); sc.start(); int result = sc.actionCleanUp(appName, userName); sc.close(); return result; } {code} ServiceClient instance has a FileSystem by calling FileSystem.get() at initialization time. but, it might be a cached one. the FileSystem cached will be closed by _sc.close()_ 3. scanForUserServices is called on SystemServiceManagerImpl.serviceStart. but, _fs_ has been closed already. RM log {code} // 1. SystemServiceManagerImpl.serviceInit // 2019-05-15 10:27:59,445 DEBUG service.AbstractService (AbstractService.java:enterState(443)) - Service: org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl entered state INITED 2019-05-15 10:27:59,446 INFO client.SystemServiceManagerImpl (SystemServiceManagerImpl.java:serviceInit(114)) - System Service Directory is configured to /services 2019-05-15 10:27:59,472 DEBUG fs.FileSystem (FileSystem.java:loadFileSystems(3209)) - Loading filesystems 2019-05-15 10:27:59,483 DEBUG fs.FileSystem (FileSystem.java:loadFileSystems(3221)) - file:// = class org.apache.hadoop.fs.LocalFileSystem from /usr/hdp/3.1.0.0-78/hadoop/hadoop-common-3.1.1.3.1.2.3.1.0.0-78.jar 2019-05-15 10:27:59,488 DEBUG fs.FileSystem (FileSystem.java:loadFileSystems(3221)) - viewfs:// = class org.apache.hadoop.fs.viewfs.ViewFileSystem from /usr/hdp/3.1.0.0-78/hadoop/hadoop-common-3.1.1.3.1.2.3.1.0.0-78.jar 2019-05-15 10:27:59,491 DEBUG fs.FileSystem (FileSystem.java:loadFileSystems(3221)) - har:// = class org.apache.hadoop.fs.HarFileSystem from /usr/hdp/3.1.0.0-78/hadoop/hadoop-common-3.1.1.3.1.2.3.1.0.0-78.jar 2019-05-15 10:27:59,492 DEBUG fs.FileSystem (FileSystem.java:loadFileSystems(3221)) - http:// = class org.apache.hadoop.fs.http.HttpFileSystem from /usr/hdp/3.1.0.0-78/hadoop/hadoop-common-3.1.1.3.1.2.3.1.0.0-78.jar 2019-05-15 10:27:59,493 DEBUG fs.FileSystem (FileSystem.java:loadFileSystems(3221)) - https:// = class org.apache.hadoop.fs.http.HttpsFileSystem from /usr/hdp/3.1.0.0-78/hadoop/hadoop-common-3.1.1.3.1.2.3.1.0.0-78.jar 2019-05-15 10:27:59,503 DEBUG fs.FileSystem (FileSystem.java:loadFileSystems(3221)) - hdfs:// = class org.apache.hadoop.hdfs.DistributedFileSystem from /usr/hdp/3.1.0.0-78/hadoop-hdfs/hadoop-hdfs-client-3.1.1.3.1.2.3.1.0.0-78.jar 2019-05-15 10:27:59,511 DEBUG fs.FileSystem (FileSystem.java:loadFileSystems(3221)) - webhdfs:// = class org.apache.hadoop.hdfs.web.WebHdfsFileSystem from /usr/hdp/3.1.0.0-78/hadoop-hdfs/hadoop-hdfs-client-3.1.1.3.1.2.3.1.0.0-78.jar 2019-05-15 10:27:59,512 DEBUG fs.FileSystem (FileSystem.java:loadFileSystems(3221)) - swebhdfs:// = class org.apache.hadoop.hdfs.web.SWebHdfsFileSystem from /usr/hdp/3.1.0.0-78/hadoop-hdfs/hadoop-hdfs-client-3.1.1.3.1.2.3.1.0.0-78.jar 2019-05-15 10:27:59,514 DEBUG fs.FileSystem (FileSystem.java:loadFileSystems(3221)) - s3n:// = class org.apache.hadoop.fs.s3native.NativeS3FileSystem from /usr/hdp/3.1.0.0-78/hadoop-mapreduce/hadoop-aws-3.1.1.3.1.2.3.1.0.0-78.jar 2019-05-15 10:27:59,514 DEBUG fs.FileSystem (FileSystem.java:getFileSystemClass(3264)) - Looking for FS supporting hdfs 2019-05-15 10:27:59,514 DEBUG fs.FileSystem (FileSystem.java:getFileSystemClass(3268)) - looking for configuration option fs.hdfs.impl 2019-05-15 10:27:59,528 DEBUG fs.FileSystem (FileSystem.java:getFileSystemClass(3275)) - Looking in service filesystems for implementation class 2019-05-15 10:27:59,528 DEBUG fs.FileSystem (FileSystem.java:getFileSystemClass(3284)) - FS for hdfs is class org.apache.hadoop.hdfs.DistributedFileSystem // 2. APP_COMPLETED event occurs // 2019-05-15 10:28:02,931 DEBUG rmapp.RMAppImpl (RMAppImpl.java:handle(895)) - Processing event for application_1556612756829_0001 of type RECOVER 2019-05-15 10:28:02,931 DEBUG rmapp.RMAppImpl (RMAppImpl.java:recover(933)) - Recovering app: application_1556612756829_0001 with 2 attempts and final state = FAILED 2019-05-15 10:28:02,931 DEBUG attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:(544)) - yarn.app.attempt.diagnostics.limit.
[jira] [Updated] (YARN-9521) RM filed to start due to system services
[ https://issues.apache.org/jira/browse/YARN-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-9521: --- Attachment: YARN-9521.001.patch > RM filed to start due to system services > > > Key: YARN-9521 > URL: https://issues.apache.org/jira/browse/YARN-9521 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.2 >Reporter: kyungwan nam >Priority: Major > Attachments: YARN-9521.001.patch > > > when starting RM, listing system services directory has failed as follows. > {code} > 2019-04-30 17:18:25,441 INFO client.SystemServiceManagerImpl > (SystemServiceManagerImpl.java:serviceInit(114)) - System Service Directory > is configured to /services > 2019-04-30 17:18:25,467 INFO client.SystemServiceManagerImpl > (SystemServiceManagerImpl.java:serviceInit(120)) - UserGroupInformation > initialized to yarn (auth:SIMPLE) > 2019-04-30 17:18:25,467 INFO service.AbstractService > (AbstractService.java:noteFailure(267)) - Service ResourceManager failed in > state STARTED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > Filesystem closed > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:203) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:869) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1228) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1269) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1265) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1265) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1316) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1501) > Caused by: java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:473) > at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1639) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1217) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1233) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1200) > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1179) > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1175) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1187) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.list(SystemServiceManagerImpl.java:375) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.scanForUserServices(SystemServiceManagerImpl.java:282) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.serviceStart(SystemServiceManagerImpl.java:126) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > ... 13 more > {code} > it looks like due to the usage of filesystem cache. > this issue does not happen, when I add "fs.hdfs.impl.disable.cache=true" to > yarn-site -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9521) RM filed to start due to system services
kyungwan nam created YARN-9521: -- Summary: RM filed to start due to system services Key: YARN-9521 URL: https://issues.apache.org/jira/browse/YARN-9521 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.1.2 Reporter: kyungwan nam when starting RM, listing system services directory has failed as follows. {code} 2019-04-30 17:18:25,441 INFO client.SystemServiceManagerImpl (SystemServiceManagerImpl.java:serviceInit(114)) - System Service Directory is configured to /services 2019-04-30 17:18:25,467 INFO client.SystemServiceManagerImpl (SystemServiceManagerImpl.java:serviceInit(120)) - UserGroupInformation initialized to yarn (auth:SIMPLE) 2019-04-30 17:18:25,467 INFO service.AbstractService (AbstractService.java:noteFailure(267)) - Service ResourceManager failed in state STARTED org.apache.hadoop.service.ServiceStateException: java.io.IOException: Filesystem closed at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:203) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:869) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1228) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1269) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1265) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1265) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1316) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1501) Caused by: java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:473) at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1639) at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1217) at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1233) at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1200) at org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1179) at org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1175) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1187) at org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.list(SystemServiceManagerImpl.java:375) at org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.scanForUserServices(SystemServiceManagerImpl.java:282) at org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.serviceStart(SystemServiceManagerImpl.java:126) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) ... 13 more {code} it looks like due to the usage of filesystem cache. this issue does not happen, when I add "fs.hdfs.impl.disable.cache=true" to yarn-site -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9307) node_partitions constraint does not work
[ https://issues.apache.org/jira/browse/YARN-9307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1682#comment-1682 ] kyungwan nam commented on YARN-9307: Thank you, [~cheersyang]! > node_partitions constraint does not work > > > Key: YARN-9307 > URL: https://issues.apache.org/jira/browse/YARN-9307 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.1 >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Fix For: 3.1.3 > > Attachments: YARN-9307.branch-3.1.001.patch > > > when a yarn-service app is submitted with below configuration, > node_partitions constraint does not work. > {code} > … > "placement_policy": { >"constraints": [ > { >"type": "ANTI_AFFINITY", >"scope": "NODE", >"target_tags": [ > "ws" >], >"node_partitions": [ > "" >] > } >] > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9386) destroying yarn-service is allowed even though running state
[ https://issues.apache.org/jira/browse/YARN-9386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16798650#comment-16798650 ] kyungwan nam commented on YARN-9386: attaches a new patch, which fixes test code. > destroying yarn-service is allowed even though running state > > > Key: YARN-9386 > URL: https://issues.apache.org/jira/browse/YARN-9386 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9386.001.patch, YARN-9386.002.patch > > > It looks very dangerous to destroy a running app. It should not be allowed. > {code} > [yarn-ats@test ~]$ yarn app -list > 19/03/12 17:48:49 INFO client.RMProxy: Connecting to ResourceManager at > test1.com/10.1.1.11:8050 > 19/03/12 17:48:50 INFO client.AHSProxy: Connecting to Application History > server at test1.com/10.1.1.101:10200 > Total number of applications (application-types: [], states: [SUBMITTED, > ACCEPTED, RUNNING] and tags: []):3 > Application-Id Application-NameApplication-Type > User Queue State Final-State >ProgressTracking-URL > application_1551250841677_0003fbyarn-service >ambari-qa default RUNNING UNDEFINED >100% N/A > application_1552379723611_0002 fb1yarn-service > yarn-ats default RUNNING UNDEFINED >100% N/A > application_1550801435420_0001 ats-hbaseyarn-service > yarn-ats default RUNNING UNDEFINED >100% N/A > [yarn-ats@test ~]$ yarn app -destroy fb1 > 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at > test1.com/10.1.1.11:8050 > 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History > server at test1.com/10.1.1.101:10200 > 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at > test1.com/10.1.1.11:8050 > 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History > server at test1.com/10.1.1.101:10200 > 19/03/12 17:49:02 INFO util.log: Logging initialized @1637ms > 19/03/12 17:49:07 INFO client.ApiServiceClient: Successfully destroyed > service fb1 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9386) destroying yarn-service is allowed even though running state
[ https://issues.apache.org/jira/browse/YARN-9386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-9386: --- Attachment: YARN-9386.002.patch > destroying yarn-service is allowed even though running state > > > Key: YARN-9386 > URL: https://issues.apache.org/jira/browse/YARN-9386 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9386.001.patch, YARN-9386.002.patch > > > It looks very dangerous to destroy a running app. It should not be allowed. > {code} > [yarn-ats@test ~]$ yarn app -list > 19/03/12 17:48:49 INFO client.RMProxy: Connecting to ResourceManager at > test1.com/10.1.1.11:8050 > 19/03/12 17:48:50 INFO client.AHSProxy: Connecting to Application History > server at test1.com/10.1.1.101:10200 > Total number of applications (application-types: [], states: [SUBMITTED, > ACCEPTED, RUNNING] and tags: []):3 > Application-Id Application-NameApplication-Type > User Queue State Final-State >ProgressTracking-URL > application_1551250841677_0003fbyarn-service >ambari-qa default RUNNING UNDEFINED >100% N/A > application_1552379723611_0002 fb1yarn-service > yarn-ats default RUNNING UNDEFINED >100% N/A > application_1550801435420_0001 ats-hbaseyarn-service > yarn-ats default RUNNING UNDEFINED >100% N/A > [yarn-ats@test ~]$ yarn app -destroy fb1 > 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at > test1.com/10.1.1.11:8050 > 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History > server at test1.com/10.1.1.101:10200 > 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at > test1.com/10.1.1.11:8050 > 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History > server at test1.com/10.1.1.101:10200 > 19/03/12 17:49:02 INFO util.log: Logging initialized @1637ms > 19/03/12 17:49:07 INFO client.ApiServiceClient: Successfully destroyed > service fb1 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9386) destroying yarn-service is allowed even though running state
[ https://issues.apache.org/jira/browse/YARN-9386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-9386: --- Attachment: YARN-9386.001.patch > destroying yarn-service is allowed even though running state > > > Key: YARN-9386 > URL: https://issues.apache.org/jira/browse/YARN-9386 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: kyungwan nam >Priority: Major > Attachments: YARN-9386.001.patch > > > It looks very dangerous to destroy a running app. It should not be allowed. > {code} > [yarn-ats@test ~]$ yarn app -list > 19/03/12 17:48:49 INFO client.RMProxy: Connecting to ResourceManager at > test1.com/10.1.1.11:8050 > 19/03/12 17:48:50 INFO client.AHSProxy: Connecting to Application History > server at test1.com/10.1.1.101:10200 > Total number of applications (application-types: [], states: [SUBMITTED, > ACCEPTED, RUNNING] and tags: []):3 > Application-Id Application-NameApplication-Type > User Queue State Final-State >ProgressTracking-URL > application_1551250841677_0003fbyarn-service >ambari-qa default RUNNING UNDEFINED >100% N/A > application_1552379723611_0002 fb1yarn-service > yarn-ats default RUNNING UNDEFINED >100% N/A > application_1550801435420_0001 ats-hbaseyarn-service > yarn-ats default RUNNING UNDEFINED >100% N/A > [yarn-ats@test ~]$ yarn app -destroy fb1 > 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at > test1.com/10.1.1.11:8050 > 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History > server at test1.com/10.1.1.101:10200 > 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at > test1.com/10.1.1.11:8050 > 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History > server at test1.com/10.1.1.101:10200 > 19/03/12 17:49:02 INFO util.log: Logging initialized @1637ms > 19/03/12 17:49:07 INFO client.ApiServiceClient: Successfully destroyed > service fb1 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org