[ https://issues.apache.org/jira/browse/YARN-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16467675#comment-16467675 ]
Billie Rinaldi commented on YARN-8243: -------------------------------------- bq. Isn't that what we want? Yes, I'm saying that removing the highest ID first is what we want, but that is not what the code is doing now. The compareTo method is checking the container start time, when it should only be checking the component instance ID. The way it is coded now, we have a bug (an additional bug beyond the one reported here). If the container start time order is comp-0, comp-2, comp-1, and we flex down by one, comp-1 would get removed and the instanceIdCounter would be decremented from 3 to 2. If we then flexed up by one, we would end up with comp-0, comp-2, comp-2. We need to fix this; it's a small change in compareTo. I am saying this fix would also address the specific issue you were seeing, but it would not address all possible cases of running instances being removed before pending instances. I do not think we should remove pending instances as is proposed in this patch because it will cause "holes" in the component ID list. If we have comp-0, comp-1, comp-2 and comp-1 is pending, when we flex down comp-1 would be removed and we would be left with comp-0 and comp-2. If we flexed up, we would then have comp-0, comp-2, comp-3. I think we should always remove the instance with the highest ID. > Flex down should first remove pending container requests (if any) and then > kill running containers > -------------------------------------------------------------------------------------------------- > > Key: YARN-8243 > URL: https://issues.apache.org/jira/browse/YARN-8243 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services > Affects Versions: 3.1.0 > Reporter: Gour Saha > Assignee: Gour Saha > Priority: Major > Attachments: YARN-8243.01.patch > > > This is easy to test on a service with anti-affinity component, to simulate > pending container requests. It can be simulated by other means also (no > resource left in cluster, etc.). > Service yarnfile used to test this - > {code:java} > { > "name": "sleeper-service", > "version": "1", > "components" : > [ > { > "name": "ping", > "number_of_containers": 2, > "resource": { > "cpus": 1, > "memory": "256" > }, > "launch_command": "sleep 9000", > "placement_policy": { > "constraints": [ > { > "type": "ANTI_AFFINITY", > "scope": "NODE", > "target_tags": [ > "ping" > ] > } > ] > } > } > ] > } > {code} > Launch a service with the above yarnfile as below - > {code:java} > yarn app -launch simple-aa-1 simple_AA.json > {code} > Let's assume there are only 5 nodes in this cluster. Now, flex the above > service to 1 extra container than the number of nodes (6 in my case). > {code:java} > yarn app -flex simple-aa-1 -component ping 6 > {code} > Only 5 containers will be allocated and running for simple-aa-1. At this > point, flex it down to 5 containers - > {code:java} > yarn app -flex simple-aa-1 -component ping 5 > {code} > This is what is seen in the serviceam log at this point - > {noformat} > 2018-05-03 20:17:38,469 [IPC Server handler 0 on 38124] INFO > service.ClientAMService - Flexing component ping to 5 > 2018-05-03 20:17:38,469 [Component dispatcher] INFO component.Component - > [FLEX DOWN COMPONENT ping]: scaling down from 6 to 5 > 2018-05-03 20:17:38,470 [Component dispatcher] INFO > instance.ComponentInstance - [COMPINSTANCE ping-4 : > container_1525297086734_0013_01_000006]: Flexed down by user, destroying. > 2018-05-03 20:17:38,473 [Component dispatcher] INFO component.Component - > [COMPONENT ping] Transitioned from FLEXING to STABLE on FLEX event. > 2018-05-03 20:17:38,474 [pool-5-thread-8] INFO > registry.YarnRegistryViewForProviders - [COMPINSTANCE ping-4 : > container_1525297086734_0013_01_000006]: Deleting registry path > /users/root/services/yarn-service/simple-aa-1/components/ctr-1525297086734-0013-01-000006 > 2018-05-03 20:17:38,476 [Component dispatcher] ERROR component.Component - > [COMPONENT ping]: Invalid event CHECK_STABLE at STABLE > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > CHECK_STABLE at STABLE > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.service.component.Component.handle(Component.java:913) > at > org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:574) > at > org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:563) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:745) > 2018-05-03 20:17:38,480 [Component dispatcher] ERROR component.Component - > [COMPONENT ping]: Invalid event CHECK_STABLE at STABLE > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > CHECK_STABLE at STABLE > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.service.component.Component.handle(Component.java:913) > at > org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:574) > at > org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:563) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:745) > 2018-05-03 20:17:38,578 [pool-5-thread-8] INFO instance.ComponentInstance - > [COMPINSTANCE ping-4 : container_1525297086734_0013_01_000006]: Deleted > component instance dir: > hdfs://ctr-e138-1518143905142-280820-01-000003.example.site:8020/user/root/.yarn/services/simple-aa-1/components/1/ping/ping-4 > 2018-05-03 20:17:39,268 [AMRM Callback Handler Thread] WARN > service.ServiceScheduler - Container container_1525297086734_0013_01_000006 > Completed. No component instance exists. exitStatus=-100. > diagnostics=Container released by application > 2018-05-03 20:17:40,273 [AMRM Callback Handler Thread] INFO > service.ServiceScheduler - 1 containers allocated. > 2018-05-03 20:17:40,273 [AMRM Callback Handler Thread] INFO > service.ServiceScheduler - [COMPONENT ping]: remove 0 outstanding container > requests for allocateId 0 > 2018-05-03 20:17:40,274 [Component dispatcher] INFO component.Component - > [COMPONENT ping]: container_1525297086734_0013_01_000007 allocated, num > pending component instances reduced to 0 > 2018-05-03 20:17:40,274 [Component dispatcher] INFO component.Component - > [COMPONENT ping]: Assigned container_1525297086734_0013_01_000007 to > component instance ping-5 and launch on host > ctr-e138-1518143905142-280820-01-000008.example.site:25454 > 2018-05-03 20:17:40,277 [pool-6-thread-6] INFO provider.ProviderUtils - > [COMPINSTANCE ping-5 : container_1525297086734_0013_01_000007]: Creating dir > on hdfs: > hdfs://ctr-e138-1518143905142-280820-01-000003.example.site:8020/user/root/.yarn/services/simple-aa-1/components/1/ping/ping-5 > 2018-05-03 20:17:40,316 [pool-6-thread-6] INFO > containerlaunch.ContainerLaunchService - launching container > container_1525297086734_0013_01_000007 > 2018-05-03 20:17:40,318 > [org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #5] INFO > impl.NMClientAsyncImpl - Processing Event EventType: START_CONTAINER for > Container container_1525297086734_0013_01_000007 > 2018-05-03 20:17:40,338 [Component dispatcher] ERROR component.Component - > [COMPONENT ping]: Invalid event CONTAINER_STARTED at STABLE > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > CONTAINER_STARTED at STABLE > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.service.component.Component.handle(Component.java:913) > at > org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:574) > at > org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:563) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:745) > {noformat} > Status response shows that only 4 containers are running and the service is > not in STABLE state - > {code:java} > yarn app -status simple-aa-1 > {code} > output - > {code:java} > { > "components": [ > { > "configuration": { > "env": {}, > "files": [], > "properties": {} > }, > "containers": [ > { > "bare_host": > "ctr-e138-1518143905142-280820-01-000007.example.site", > "component_instance_name": "ping-1", > "hostname": > "ctr-e138-1518143905142-280820-01-000007.example.site", > "id": "container_1525297086734_0013_01_000003", > "ip": "x.x.x.x", > "launch_time": 1525378141535, > "state": "READY" > }, > { > "bare_host": > "ctr-e138-1518143905142-280820-01-000006.example.site", > "component_instance_name": "ping-0", > "hostname": > "ctr-e138-1518143905142-280820-01-000006.example.site", > "id": "container_1525297086734_0013_01_000002", > "ip": "x.x.x.x", > "launch_time": 1525378141513, > "state": "READY" > }, > { > "bare_host": > "ctr-e138-1518143905142-280820-01-000005.example.site", > "component_instance_name": "ping-3", > "hostname": > "ctr-e138-1518143905142-280820-01-000005.example.site", > "id": "container_1525297086734_0013_01_000005", > "ip": "x.x.x.x", > "launch_time": 1525378303429, > "state": "READY" > }, > { > "bare_host": > "ctr-e138-1518143905142-280820-01-000004.example.site", > "component_instance_name": "ping-2", > "hostname": > "ctr-e138-1518143905142-280820-01-000004.example.site", > "id": "container_1525297086734_0013_01_000004", > "ip": "x.x.x.x", > "launch_time": 1525378303425, > "state": "READY" > } > ], > "dependencies": [], > "launch_command": "sleep 9000", > "name": "ping", > "number_of_containers": 5, > "placement_policy": { > "constraints": [ > { > "node_attributes": {}, > "node_partitions": [], > "scope": "NODE", > "target_tags": [ > "ping" > ], > "type": "ANTI_AFFINITY" > } > ] > }, > "quicklinks": [], > "resource": { > "additional": {}, > "cpus": 1, > "memory": "256" > }, > "run_privileged_container": false, > "state": "FLEXING" > } > ], > "configuration": { > "env": {}, > "files": [], > "properties": {} > }, > "id": "application_1525297086734_0013", > "kerberos_principal": {}, > "lifetime": -1, > "name": "simple-aa-1", > "quicklinks": {}, > "state": "STARTED", > "version": "1" > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org