[ https://issues.apache.org/jira/browse/YARN-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16469020#comment-16469020 ]
Gour Saha commented on YARN-8243: --------------------------------- bq. The compareTo method is checking the container start time, when it should only be checking the component instance ID. [~billie.rinaldi], I agree to this. We should make compareTo explicitly check the instance ID. Fortunately, the way component instance objects are created and then allocated containers are assigned, looks like the start time order is same as the instance ID order. bq. I do not think we should remove pending instances as is proposed in this patch because it will cause "holes" in the component ID list. If we have comp-0, comp-1, comp-2 and comp-1 is pending, when we flex down comp-1 would be removed and we would be left with comp-0 and comp-2. If we flexed up, we would then have comp-0, comp-2, comp-3. I think we should always remove the instance with the highest ID. Once we make the compareTo change as suggested above, this situation will not occur. Nevertheless, say even if we couldn't programmatically fix the holes issue, we should not remove a running instance when there is a pending instance. Let's say a service owner flexes up her service by 2 extra containers. The containers don't get allocated for a long time (because of no resource left in the cluster or something else). Meanwhile, the service owner decides to go down back to the original number of instances because she does not need those 2 additional containers. At this point, she expects no change to her service but then sees 2 containers killed. It will not be a desirable experience. > Flex down should first remove pending container requests (if any) and then > kill running containers > -------------------------------------------------------------------------------------------------- > > Key: YARN-8243 > URL: https://issues.apache.org/jira/browse/YARN-8243 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services > Affects Versions: 3.1.0 > Reporter: Gour Saha > Assignee: Gour Saha > Priority: Major > Attachments: YARN-8243.01.patch > > > This is easy to test on a service with anti-affinity component, to simulate > pending container requests. It can be simulated by other means also (no > resource left in cluster, etc.). > Service yarnfile used to test this - > {code:java} > { > "name": "sleeper-service", > "version": "1", > "components" : > [ > { > "name": "ping", > "number_of_containers": 2, > "resource": { > "cpus": 1, > "memory": "256" > }, > "launch_command": "sleep 9000", > "placement_policy": { > "constraints": [ > { > "type": "ANTI_AFFINITY", > "scope": "NODE", > "target_tags": [ > "ping" > ] > } > ] > } > } > ] > } > {code} > Launch a service with the above yarnfile as below - > {code:java} > yarn app -launch simple-aa-1 simple_AA.json > {code} > Let's assume there are only 5 nodes in this cluster. Now, flex the above > service to 1 extra container than the number of nodes (6 in my case). > {code:java} > yarn app -flex simple-aa-1 -component ping 6 > {code} > Only 5 containers will be allocated and running for simple-aa-1. At this > point, flex it down to 5 containers - > {code:java} > yarn app -flex simple-aa-1 -component ping 5 > {code} > This is what is seen in the serviceam log at this point - > {noformat} > 2018-05-03 20:17:38,469 [IPC Server handler 0 on 38124] INFO > service.ClientAMService - Flexing component ping to 5 > 2018-05-03 20:17:38,469 [Component dispatcher] INFO component.Component - > [FLEX DOWN COMPONENT ping]: scaling down from 6 to 5 > 2018-05-03 20:17:38,470 [Component dispatcher] INFO > instance.ComponentInstance - [COMPINSTANCE ping-4 : > container_1525297086734_0013_01_000006]: Flexed down by user, destroying. > 2018-05-03 20:17:38,473 [Component dispatcher] INFO component.Component - > [COMPONENT ping] Transitioned from FLEXING to STABLE on FLEX event. > 2018-05-03 20:17:38,474 [pool-5-thread-8] INFO > registry.YarnRegistryViewForProviders - [COMPINSTANCE ping-4 : > container_1525297086734_0013_01_000006]: Deleting registry path > /users/root/services/yarn-service/simple-aa-1/components/ctr-1525297086734-0013-01-000006 > 2018-05-03 20:17:38,476 [Component dispatcher] ERROR component.Component - > [COMPONENT ping]: Invalid event CHECK_STABLE at STABLE > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > CHECK_STABLE at STABLE > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.service.component.Component.handle(Component.java:913) > at > org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:574) > at > org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:563) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:745) > 2018-05-03 20:17:38,480 [Component dispatcher] ERROR component.Component - > [COMPONENT ping]: Invalid event CHECK_STABLE at STABLE > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > CHECK_STABLE at STABLE > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.service.component.Component.handle(Component.java:913) > at > org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:574) > at > org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:563) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:745) > 2018-05-03 20:17:38,578 [pool-5-thread-8] INFO instance.ComponentInstance - > [COMPINSTANCE ping-4 : container_1525297086734_0013_01_000006]: Deleted > component instance dir: > hdfs://ctr-e138-1518143905142-280820-01-000003.example.site:8020/user/root/.yarn/services/simple-aa-1/components/1/ping/ping-4 > 2018-05-03 20:17:39,268 [AMRM Callback Handler Thread] WARN > service.ServiceScheduler - Container container_1525297086734_0013_01_000006 > Completed. No component instance exists. exitStatus=-100. > diagnostics=Container released by application > 2018-05-03 20:17:40,273 [AMRM Callback Handler Thread] INFO > service.ServiceScheduler - 1 containers allocated. > 2018-05-03 20:17:40,273 [AMRM Callback Handler Thread] INFO > service.ServiceScheduler - [COMPONENT ping]: remove 0 outstanding container > requests for allocateId 0 > 2018-05-03 20:17:40,274 [Component dispatcher] INFO component.Component - > [COMPONENT ping]: container_1525297086734_0013_01_000007 allocated, num > pending component instances reduced to 0 > 2018-05-03 20:17:40,274 [Component dispatcher] INFO component.Component - > [COMPONENT ping]: Assigned container_1525297086734_0013_01_000007 to > component instance ping-5 and launch on host > ctr-e138-1518143905142-280820-01-000008.example.site:25454 > 2018-05-03 20:17:40,277 [pool-6-thread-6] INFO provider.ProviderUtils - > [COMPINSTANCE ping-5 : container_1525297086734_0013_01_000007]: Creating dir > on hdfs: > hdfs://ctr-e138-1518143905142-280820-01-000003.example.site:8020/user/root/.yarn/services/simple-aa-1/components/1/ping/ping-5 > 2018-05-03 20:17:40,316 [pool-6-thread-6] INFO > containerlaunch.ContainerLaunchService - launching container > container_1525297086734_0013_01_000007 > 2018-05-03 20:17:40,318 > [org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #5] INFO > impl.NMClientAsyncImpl - Processing Event EventType: START_CONTAINER for > Container container_1525297086734_0013_01_000007 > 2018-05-03 20:17:40,338 [Component dispatcher] ERROR component.Component - > [COMPONENT ping]: Invalid event CONTAINER_STARTED at STABLE > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > CONTAINER_STARTED at STABLE > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.service.component.Component.handle(Component.java:913) > at > org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:574) > at > org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:563) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:745) > {noformat} > Status response shows that only 4 containers are running and the service is > not in STABLE state - > {code:java} > yarn app -status simple-aa-1 > {code} > output - > {code:java} > { > "components": [ > { > "configuration": { > "env": {}, > "files": [], > "properties": {} > }, > "containers": [ > { > "bare_host": > "ctr-e138-1518143905142-280820-01-000007.example.site", > "component_instance_name": "ping-1", > "hostname": > "ctr-e138-1518143905142-280820-01-000007.example.site", > "id": "container_1525297086734_0013_01_000003", > "ip": "x.x.x.x", > "launch_time": 1525378141535, > "state": "READY" > }, > { > "bare_host": > "ctr-e138-1518143905142-280820-01-000006.example.site", > "component_instance_name": "ping-0", > "hostname": > "ctr-e138-1518143905142-280820-01-000006.example.site", > "id": "container_1525297086734_0013_01_000002", > "ip": "x.x.x.x", > "launch_time": 1525378141513, > "state": "READY" > }, > { > "bare_host": > "ctr-e138-1518143905142-280820-01-000005.example.site", > "component_instance_name": "ping-3", > "hostname": > "ctr-e138-1518143905142-280820-01-000005.example.site", > "id": "container_1525297086734_0013_01_000005", > "ip": "x.x.x.x", > "launch_time": 1525378303429, > "state": "READY" > }, > { > "bare_host": > "ctr-e138-1518143905142-280820-01-000004.example.site", > "component_instance_name": "ping-2", > "hostname": > "ctr-e138-1518143905142-280820-01-000004.example.site", > "id": "container_1525297086734_0013_01_000004", > "ip": "x.x.x.x", > "launch_time": 1525378303425, > "state": "READY" > } > ], > "dependencies": [], > "launch_command": "sleep 9000", > "name": "ping", > "number_of_containers": 5, > "placement_policy": { > "constraints": [ > { > "node_attributes": {}, > "node_partitions": [], > "scope": "NODE", > "target_tags": [ > "ping" > ], > "type": "ANTI_AFFINITY" > } > ] > }, > "quicklinks": [], > "resource": { > "additional": {}, > "cpus": 1, > "memory": "256" > }, > "run_privileged_container": false, > "state": "FLEXING" > } > ], > "configuration": { > "env": {}, > "files": [], > "properties": {} > }, > "id": "application_1525297086734_0013", > "kerberos_principal": {}, > "lifetime": -1, > "name": "simple-aa-1", > "quicklinks": {}, > "state": "STARTED", > "version": "1" > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org