[ 
https://issues.apache.org/jira/browse/TWILL-213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15856669#comment-15856669
 ] 

Martin Serrano commented on TWILL-213:
--------------------------------------

I've updated the description with what I think is happening.

> Increase of instances while starting up may lead to ignored retries and 
> instance increases
> ------------------------------------------------------------------------------------------
>
>                 Key: TWILL-213
>                 URL: https://issues.apache.org/jira/browse/TWILL-213
>             Project: Apache Twill
>          Issue Type: Bug
>          Components: yarn
>    Affects Versions: 0.9.0
>            Reporter: Martin Serrano
>
> As seen in the test development for TWILL-181, if the number of instances for 
> a container is increased before the {{ApplicationMasterService}} has observed 
> the original request as being satisfied, the instance increase and any 
> subsequent retries will be blocked.  This is because in {{launchRunnable}}:
> {code}
>     TwillContainerLauncher launcher = new TwillContainerLauncher(
>         twillSpec.getRunnables().get(runnableName), 
> processLauncher.getContainerInfo(), launchContext,
>         ZKClients.namespace(zkClient, getZKNamespace(runnableName)),
>         containerCount, jvmOpts, reservedMemory, getSecureStoreLocation());
>       runningContainers.start(runnableName, 
> processLauncher.getContainerInfo(), launcher);
>       // Need to call complete to workaround bug in YARN AMRMClient
>       if (provisionRequest.containerAcquired()) {
>         amClient.completeContainerRequest(provisionRequest.getRequestId());
>       }
>       /*
>        * The provisionRequest will either contain a single container 
> (ALLOCATE_ONE_INSTANCE_AT_A_TIME), or all the
>        * containers to satisfy the expectedContainers count. In the later 
> case, the provision request is complete once
>        * all the containers have run at which point we poll() to remove the 
> provisioning request.
>        */
>       if (expectedContainers.getExpected(runnableName) == 
> runningContainers.count(runnableName) ||
>         
> provisioning.peek().getType().equals(AllocationSpecification.Type.ALLOCATE_ONE_INSTANCE_AT_A_TIME))
>  {
>         provisioning.poll();
>       }
> {code}
> There is a race condition.  The sequence:
> * *Thread A*: {{runningContainers.start}} is called and 2 instances are 
> started
> * *Thread B*: The runnable from {{createSetInstanceRunnable}} executes, sees 
> the 2 instances are started and updates the expected count to 3.
> * *Thread A*: Gets to if check comparing expectedContainers (3) to 
> runningContainers.count (2).  Since this fails, {{poll}} is not called and 
> this provision request is not satisfied.
> Subsequent calls will try to provision the 3rd container because it seems 
> like the first provision request is not yet satisfied.
> The {{MaxRetriesTestRun.maxRetriesWithIncreasedInstances}} method can be used 
> to reproduce this case intermittently by changing the {{allRunning.await}} 
> check to something that does a countdown latch {{onRunning}} as 
> {{EchoServerTestRun}} does.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to