kyungwan nam created YARN-9691: ---------------------------------- Summary: canceling upgrade does not work if upgrade failed container is existing Key: YARN-9691 URL: https://issues.apache.org/jira/browse/YARN-9691 Project: Hadoop YARN Issue Type: Bug Reporter: kyungwan nam Assignee: kyungwan nam
if a container is failed to upgrade during yarn service upgrade, it will be released container and transition to FAILED_UPGRADE state. After then, I expected it is able to be back to the previous version using cancel-upgrade. but, It didn’t work. At that time, AM log is as follows {code} # failed to upgrade container_e62_1563179597798_0006_01_000008 2019-07-16 18:21:55,152 [IPC Server handler 0 on 39483] INFO service.ClientAMService - Upgrade container container_e62_1563179597798_0006_01_000008 2019-07-16 18:21:55,153 [Component dispatcher] INFO instance.ComponentInstance - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_000008] spec state state changed from NEEDS_UPGRADE -> UPGRADING 2019-07-16 18:21:55,154 [Component dispatcher] INFO instance.ComponentInstance - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_000008] Transitioned from READY to UPGRADING on UPGRADE event 2019-07-16 18:21:55,154 [pool-5-thread-4] INFO registry.YarnRegistryViewForProviders - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_000008]: Deleting registry path /users/test/services/yarn-service/sleeptest/components/ctr-e62-1563179597798-0006-01-000008 2019-07-16 18:21:55,156 [pool-6-thread-6] INFO provider.ProviderUtils - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_000008] version 1.0.1 : Creating dir on hdfs: hdfs://test1.com:8020/user/test/.yarn/services/sleeptest/components/1.0.1/sleep/sleep-0 2019-07-16 18:21:55,157 [pool-6-thread-6] INFO containerlaunch.ContainerLaunchService - reInitializing container container_e62_1563179597798_0006_01_000008 with version 1.0.1 2019-07-16 18:21:55,157 [pool-6-thread-6] INFO containerlaunch.AbstractLauncher - yarn docker env var has been set {LANGUAGE=en_US.UTF-8, HADOOP_USER_NAME=test, YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_HOSTNAME=sleep-0.sleeptest.test.EXAMPLE.COM, WORK_DIR=$PWD, LC_ALL=en_US.UTF-8, YARN_CONTAINER_RUNTIME_TYPE=docker, YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=registry.test.com/test/sleep1:latest, LANG=en_US.UTF-8, YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=bridge, YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE=true, LOG_DIR=<LOG_DIR>} 2019-07-16 18:21:55,158 [org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #7] INFO impl.NMClientAsyncImpl - Processing Event EventType: REINITIALIZE_CONTAINER for Container container_e62_1563179597798_0006_01_000008 2019-07-16 18:21:55,167 [Component dispatcher] INFO instance.ComponentInstance - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_000008] spec state state changed from UPGRADING -> RUNNING_BUT_UNREADY 2019-07-16 18:21:55,167 [Component dispatcher] INFO instance.ComponentInstance - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_000008] retrieve status after 30 2019-07-16 18:21:55,167 [Component dispatcher] INFO instance.ComponentInstance - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_000008] Transitioned from UPGRADING to REINITIALIZED on START event 2019-07-16 18:22:07,797 [pool-7-thread-1] INFO monitor.ServiceMonitor - Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:22:07 KST 2019", outcome="failure", message="Failure in Default probe: IP presence", exception="java.io.IOException: sleep-0: IP is not available yet" 2019-07-16 18:22:37,797 [pool-7-thread-1] INFO monitor.ServiceMonitor - Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:22:37 KST 2019", outcome="failure", message="Failure in Default probe: IP presence", exception="java.io.IOException: sleep-0: IP is not available yet" 2019-07-16 18:23:07,797 [pool-7-thread-1] INFO monitor.ServiceMonitor - Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:23:07 KST 2019", outcome="failure", message="Failure in Default probe: IP presence", exception="java.io.IOException: sleep-0: IP is not available yet" 2019-07-16 18:23:08,225 [Component dispatcher] INFO instance.ComponentInstance - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_000008] spec state state changed from RUNNING_BUT_UNREADY -> FAILED_UPGRADE # request canceling upgrade 2019-07-16 18:28:22,713 [Component dispatcher] INFO service.ServiceManager - Upgrade container container_e62_1563179597798_0006_01_000004 true 2019-07-16 18:28:22,713 [Component dispatcher] INFO service.ServiceManager - Upgrade container container_e62_1563179597798_0006_01_000003 true 2019-07-16 18:28:22,713 [Component dispatcher] INFO service.ServiceManager - Upgrade container container_e62_1563179597798_0006_01_000008 true 2019-07-16 18:28:22,713 [Component dispatcher] INFO service.ServiceManager - [SERVICE] spec state changed from UPGRADING -> CANCEL_UPGRADING 2019-07-16 18:28:22,713 [Component dispatcher] INFO component.Component - [COMPONENT sleep]: need upgrade to 1.0.0 2019-07-16 18:28:22,713 [Component dispatcher] INFO instance.ComponentInstance - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_000008] spec state state changed from FAILED_UPGRADE -> NEEDS_UPGRADE 2019-07-16 18:28:22,713 [Component dispatcher] INFO component.Component - [COMPONENT sleep] Transitioned from UPGRADING to CANCEL_UPGRADING on CANCEL_UPGRADE event. 2019-07-16 18:28:22,713 [Component dispatcher] INFO component.Component - [COMPONENT sleep1]: need upgrade to 1.0.0 2019-07-16 18:28:22,714 [Component dispatcher] INFO component.Component - [COMPONENT sleep1] Transitioned from UPGRADING to CANCEL_UPGRADING on CANCEL_UPGRADE event. 2019-07-16 18:28:22,714 [Component dispatcher] INFO instance.ComponentInstance - container_e62_1563179597798_0006_01_000004 nothing to cancel 2019-07-16 18:28:22,714 [Component dispatcher] INFO instance.ComponentInstance - [COMPINSTANCE sleep-2 : container_e62_1563179597798_0006_01_000004] spec state state changed from NEEDS_UPGRADE -> READY 2019-07-16 18:28:22,714 [Component dispatcher] INFO instance.ComponentInstance - container_e62_1563179597798_0006_01_000003 nothing to cancel 2019-07-16 18:28:22,714 [Component dispatcher] INFO instance.ComponentInstance - [COMPINSTANCE sleep-1 : container_e62_1563179597798_0006_01_000003] spec state state changed from NEEDS_UPGRADE -> READY 2019-07-16 18:28:22,714 [Component dispatcher] ERROR service.ServiceScheduler - No component instance exists for container_e62_1563179597798_0006_01_000008 {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org