[ 
https://issues.apache.org/jira/browse/YARN-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16469020#comment-16469020
 ] 

Gour Saha commented on YARN-8243:
---------------------------------

bq. The compareTo method is checking the container start time, when it should 
only be checking the component instance ID.
[~billie.rinaldi], I agree to this. We should make compareTo explicitly check 
the instance ID. Fortunately, the way component instance objects are created 
and then allocated containers are assigned, looks like the start time order is 
same as the instance ID order.

bq. I do not think we should remove pending instances as is proposed in this 
patch because it will cause "holes" in the component ID list. If we have 
comp-0, comp-1, comp-2 and comp-1 is pending, when we flex down comp-1 would be 
removed and we would be left with comp-0 and comp-2. If we flexed up, we would 
then have comp-0, comp-2, comp-3. I think we should always remove the instance 
with the highest ID.
Once we make the compareTo change as suggested above, this situation will not 
occur. Nevertheless, say even if we couldn't programmatically fix the holes 
issue, we should not remove a running instance when there is a pending 
instance. Let's say a service owner flexes up her service by 2 extra 
containers. The containers don't get allocated for a long time (because of no 
resource left in the cluster or something else). Meanwhile, the service owner 
decides to go down back to the original number of instances because she does 
not need those 2 additional containers. At this point, she expects no change to 
her service but then sees 2 containers killed. It will not be a desirable 
experience.


> Flex down should first remove pending container requests (if any) and then 
> kill running containers
> --------------------------------------------------------------------------------------------------
>
>                 Key: YARN-8243
>                 URL: https://issues.apache.org/jira/browse/YARN-8243
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: yarn-native-services
>    Affects Versions: 3.1.0
>            Reporter: Gour Saha
>            Assignee: Gour Saha
>            Priority: Major
>         Attachments: YARN-8243.01.patch
>
>
> This is easy to test on a service with anti-affinity component, to simulate 
> pending container requests. It can be simulated by other means also (no 
> resource left in cluster, etc.).
> Service yarnfile used to test this -
> {code:java}
> {
>   "name": "sleeper-service",
>   "version": "1",
>   "components" :
>   [
>     {
>       "name": "ping",
>       "number_of_containers": 2,
>       "resource": {
>         "cpus": 1,
>         "memory": "256"
>       },
>       "launch_command": "sleep 9000",
>       "placement_policy": {
>         "constraints": [
>           {
>             "type": "ANTI_AFFINITY",
>             "scope": "NODE",
>             "target_tags": [
>               "ping"
>             ]
>           }
>         ]
>       }
>     }
>   ]
> }
> {code}
> Launch a service with the above yarnfile as below -
> {code:java}
> yarn app -launch simple-aa-1 simple_AA.json
> {code}
> Let's assume there are only 5 nodes in this cluster. Now, flex the above 
> service to 1 extra container than the number of nodes (6 in my case).
> {code:java}
> yarn app -flex simple-aa-1 -component ping 6
> {code}
> Only 5 containers will be allocated and running for simple-aa-1. At this 
> point, flex it down to 5 containers -
> {code:java}
> yarn app -flex simple-aa-1 -component ping 5
> {code}
> This is what is seen in the serviceam log at this point -
> {noformat}
> 2018-05-03 20:17:38,469 [IPC Server handler 0 on 38124] INFO  
> service.ClientAMService - Flexing component ping to 5
> 2018-05-03 20:17:38,469 [Component  dispatcher] INFO  component.Component - 
> [FLEX DOWN COMPONENT ping]: scaling down from 6 to 5
> 2018-05-03 20:17:38,470 [Component  dispatcher] INFO  
> instance.ComponentInstance - [COMPINSTANCE ping-4 : 
> container_1525297086734_0013_01_000006]: Flexed down by user, destroying.
> 2018-05-03 20:17:38,473 [Component  dispatcher] INFO  component.Component - 
> [COMPONENT ping] Transitioned from FLEXING to STABLE on FLEX event.
> 2018-05-03 20:17:38,474 [pool-5-thread-8] INFO  
> registry.YarnRegistryViewForProviders - [COMPINSTANCE ping-4 : 
> container_1525297086734_0013_01_000006]: Deleting registry path 
> /users/root/services/yarn-service/simple-aa-1/components/ctr-1525297086734-0013-01-000006
> 2018-05-03 20:17:38,476 [Component  dispatcher] ERROR component.Component - 
> [COMPONENT ping]: Invalid event CHECK_STABLE at STABLE
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
> CHECK_STABLE at STABLE
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
>       at 
> org.apache.hadoop.yarn.service.component.Component.handle(Component.java:913)
>       at 
> org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:574)
>       at 
> org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:563)
>       at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
>       at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
>       at java.lang.Thread.run(Thread.java:745)
> 2018-05-03 20:17:38,480 [Component  dispatcher] ERROR component.Component - 
> [COMPONENT ping]: Invalid event CHECK_STABLE at STABLE
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
> CHECK_STABLE at STABLE
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
>       at 
> org.apache.hadoop.yarn.service.component.Component.handle(Component.java:913)
>       at 
> org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:574)
>       at 
> org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:563)
>       at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
>       at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
>       at java.lang.Thread.run(Thread.java:745)
> 2018-05-03 20:17:38,578 [pool-5-thread-8] INFO  instance.ComponentInstance - 
> [COMPINSTANCE ping-4 : container_1525297086734_0013_01_000006]: Deleted 
> component instance dir: 
> hdfs://ctr-e138-1518143905142-280820-01-000003.example.site:8020/user/root/.yarn/services/simple-aa-1/components/1/ping/ping-4
> 2018-05-03 20:17:39,268 [AMRM Callback Handler Thread] WARN  
> service.ServiceScheduler - Container container_1525297086734_0013_01_000006 
> Completed. No component instance exists. exitStatus=-100. 
> diagnostics=Container released by application 
> 2018-05-03 20:17:40,273 [AMRM Callback Handler Thread] INFO  
> service.ServiceScheduler - 1 containers allocated. 
> 2018-05-03 20:17:40,273 [AMRM Callback Handler Thread] INFO  
> service.ServiceScheduler - [COMPONENT ping]: remove 0 outstanding container 
> requests for allocateId 0
> 2018-05-03 20:17:40,274 [Component  dispatcher] INFO  component.Component - 
> [COMPONENT ping]: container_1525297086734_0013_01_000007 allocated, num 
> pending component instances reduced to 0
> 2018-05-03 20:17:40,274 [Component  dispatcher] INFO  component.Component - 
> [COMPONENT ping]: Assigned container_1525297086734_0013_01_000007 to 
> component instance ping-5 and launch on host 
> ctr-e138-1518143905142-280820-01-000008.example.site:25454 
> 2018-05-03 20:17:40,277 [pool-6-thread-6] INFO  provider.ProviderUtils - 
> [COMPINSTANCE ping-5 : container_1525297086734_0013_01_000007]: Creating dir 
> on hdfs: 
> hdfs://ctr-e138-1518143905142-280820-01-000003.example.site:8020/user/root/.yarn/services/simple-aa-1/components/1/ping/ping-5
> 2018-05-03 20:17:40,316 [pool-6-thread-6] INFO  
> containerlaunch.ContainerLaunchService - launching container 
> container_1525297086734_0013_01_000007
> 2018-05-03 20:17:40,318 
> [org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #5] INFO  
> impl.NMClientAsyncImpl - Processing Event EventType: START_CONTAINER for 
> Container container_1525297086734_0013_01_000007
> 2018-05-03 20:17:40,338 [Component  dispatcher] ERROR component.Component - 
> [COMPONENT ping]: Invalid event CONTAINER_STARTED at STABLE
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
> CONTAINER_STARTED at STABLE
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
>       at 
> org.apache.hadoop.yarn.service.component.Component.handle(Component.java:913)
>       at 
> org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:574)
>       at 
> org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:563)
>       at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
>       at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
>       at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Status response shows that only 4 containers are running and the service is 
> not in STABLE state -
> {code:java}
> yarn app -status simple-aa-1
> {code}
> output -
> {code:java}
> {
>     "components": [
>         {
>             "configuration": {
>                 "env": {},
>                 "files": [],
>                 "properties": {}
>             },
>             "containers": [
>                 {
>                     "bare_host": 
> "ctr-e138-1518143905142-280820-01-000007.example.site",
>                     "component_instance_name": "ping-1",
>                     "hostname": 
> "ctr-e138-1518143905142-280820-01-000007.example.site",
>                     "id": "container_1525297086734_0013_01_000003",
>                     "ip": "x.x.x.x",
>                     "launch_time": 1525378141535,
>                     "state": "READY"
>                 },
>                 {
>                     "bare_host": 
> "ctr-e138-1518143905142-280820-01-000006.example.site",
>                     "component_instance_name": "ping-0",
>                     "hostname": 
> "ctr-e138-1518143905142-280820-01-000006.example.site",
>                     "id": "container_1525297086734_0013_01_000002",
>                     "ip": "x.x.x.x",
>                     "launch_time": 1525378141513,
>                     "state": "READY"
>                 },
>                 {
>                     "bare_host": 
> "ctr-e138-1518143905142-280820-01-000005.example.site",
>                     "component_instance_name": "ping-3",
>                     "hostname": 
> "ctr-e138-1518143905142-280820-01-000005.example.site",
>                     "id": "container_1525297086734_0013_01_000005",
>                     "ip": "x.x.x.x",
>                     "launch_time": 1525378303429,
>                     "state": "READY"
>                 },
>                 {
>                     "bare_host": 
> "ctr-e138-1518143905142-280820-01-000004.example.site",
>                     "component_instance_name": "ping-2",
>                     "hostname": 
> "ctr-e138-1518143905142-280820-01-000004.example.site",
>                     "id": "container_1525297086734_0013_01_000004",
>                     "ip": "x.x.x.x",
>                     "launch_time": 1525378303425,
>                     "state": "READY"
>                 }
>             ],
>             "dependencies": [],
>             "launch_command": "sleep 9000",
>             "name": "ping",
>             "number_of_containers": 5,
>             "placement_policy": {
>                 "constraints": [
>                     {
>                         "node_attributes": {},
>                         "node_partitions": [],
>                         "scope": "NODE",
>                         "target_tags": [
>                             "ping"
>                         ],
>                         "type": "ANTI_AFFINITY"
>                     }
>                 ]
>             },
>             "quicklinks": [],
>             "resource": {
>                 "additional": {},
>                 "cpus": 1,
>                 "memory": "256"
>             },
>             "run_privileged_container": false,
>             "state": "FLEXING"
>         }
>     ],
>     "configuration": {
>         "env": {},
>         "files": [],
>         "properties": {}
>     },
>     "id": "application_1525297086734_0013",
>     "kerberos_principal": {},
>     "lifetime": -1,
>     "name": "simple-aa-1",
>     "quicklinks": {},
>     "state": "STARTED",
>     "version": "1"
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to