[ 
https://issues.apache.org/jira/browse/TWILL-213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Serrano updated TWILL-213:
---------------------------------
    Description: 
As seen in the test development for TWILL-181, if the number of instances for a 
container is increased before the {{ApplicationMasterService}} has observed the 
original request as being satisfied, the instance increase and any subsequent 
retries will be blocked.  This is because in {{launchRunnable}}:

{code}
    TwillContainerLauncher launcher = new TwillContainerLauncher(
        twillSpec.getRunnables().get(runnableName), 
processLauncher.getContainerInfo(), launchContext,
        ZKClients.namespace(zkClient, getZKNamespace(runnableName)),
        containerCount, jvmOpts, reservedMemory, getSecureStoreLocation());

      runningContainers.start(runnableName, processLauncher.getContainerInfo(), 
launcher);

      // Need to call complete to workaround bug in YARN AMRMClient
      if (provisionRequest.containerAcquired()) {
        amClient.completeContainerRequest(provisionRequest.getRequestId());
      }

      /*
       * The provisionRequest will either contain a single container 
(ALLOCATE_ONE_INSTANCE_AT_A_TIME), or all the
       * containers to satisfy the expectedContainers count. In the later case, 
the provision request is complete once
       * all the containers have run at which point we poll() to remove the 
provisioning request.
       */
      if (expectedContainers.getExpected(runnableName) == 
runningContainers.count(runnableName) ||
        
provisioning.peek().getType().equals(AllocationSpecification.Type.ALLOCATE_ONE_INSTANCE_AT_A_TIME))
 {
        provisioning.poll();
      }
{code}

There is a race condition.  The sequence:

* *Thread A*: {{runningContainers.start}} is called and 2 instances are started
* *Thread B*: The runnable from {{createSetInstanceRunnable}} executes, sees 
the 2 instances are started and updates the expected count to 3.
* *Thread A*: Gets to if check comparing expectedContainers (3) to 
runningContainers.count (2).  Since this fails, {{poll}} is not called and this 
provision request is not satisfied.

Subsequent calls will try to provision the 3rd container because it seems like 
the first provision request is not yet satisfied.

The {{MaxRetriesTestRun.maxRetriesWithIncreasedInstances}} method can be used 
to reproduce this case intermittently by changing the {{allRunning.await}} 
check to something that does a countdown latch {{onRunning}} as 
{{EchoServerTestRun}} does.

  was:
As seen in the test development for TWILL-181, if the number of instances for a 
container is increased before the {{ApplicationMasterService}} has observed the 
original request as being satisfied, the instance increase and any subsequent 
retries will be blocked.  This is because in {{launchRunnable}}:

{code}
      if (expectedContainers.getExpected(runnableName) == 
runningContainers.count(runnableName) ||
        
provisioning.peek().getType().equals(AllocationSpecification.Type.ALLOCATE_ONE_INSTANCE_AT_A_TIME))
 {
        provisioning.poll();
      }
{code}

we are comparing the expected containers to the running count to decide if 
{{provisioning.poll()}} should be called.   If a new instance request has been 
made, the expected containers will have been updated and the running count 
never will.  The {{MaxRetriesTestRun.maxRetriesWithIncreasedInstances}} method 
can be used to reproduce this case intermittently by changing the 
{{allRunning.await}} check to something that does a countdown latch 
{{onRunning}} as {{EchoServerTestRun}} does.


> Increase of instances while starting up may lead to ignored retries and 
> instance increases
> ------------------------------------------------------------------------------------------
>
>                 Key: TWILL-213
>                 URL: https://issues.apache.org/jira/browse/TWILL-213
>             Project: Apache Twill
>          Issue Type: Bug
>          Components: yarn
>    Affects Versions: 0.9.0
>            Reporter: Martin Serrano
>
> As seen in the test development for TWILL-181, if the number of instances for 
> a container is increased before the {{ApplicationMasterService}} has observed 
> the original request as being satisfied, the instance increase and any 
> subsequent retries will be blocked.  This is because in {{launchRunnable}}:
> {code}
>     TwillContainerLauncher launcher = new TwillContainerLauncher(
>         twillSpec.getRunnables().get(runnableName), 
> processLauncher.getContainerInfo(), launchContext,
>         ZKClients.namespace(zkClient, getZKNamespace(runnableName)),
>         containerCount, jvmOpts, reservedMemory, getSecureStoreLocation());
>       runningContainers.start(runnableName, 
> processLauncher.getContainerInfo(), launcher);
>       // Need to call complete to workaround bug in YARN AMRMClient
>       if (provisionRequest.containerAcquired()) {
>         amClient.completeContainerRequest(provisionRequest.getRequestId());
>       }
>       /*
>        * The provisionRequest will either contain a single container 
> (ALLOCATE_ONE_INSTANCE_AT_A_TIME), or all the
>        * containers to satisfy the expectedContainers count. In the later 
> case, the provision request is complete once
>        * all the containers have run at which point we poll() to remove the 
> provisioning request.
>        */
>       if (expectedContainers.getExpected(runnableName) == 
> runningContainers.count(runnableName) ||
>         
> provisioning.peek().getType().equals(AllocationSpecification.Type.ALLOCATE_ONE_INSTANCE_AT_A_TIME))
>  {
>         provisioning.poll();
>       }
> {code}
> There is a race condition.  The sequence:
> * *Thread A*: {{runningContainers.start}} is called and 2 instances are 
> started
> * *Thread B*: The runnable from {{createSetInstanceRunnable}} executes, sees 
> the 2 instances are started and updates the expected count to 3.
> * *Thread A*: Gets to if check comparing expectedContainers (3) to 
> runningContainers.count (2).  Since this fails, {{poll}} is not called and 
> this provision request is not satisfied.
> Subsequent calls will try to provision the 3rd container because it seems 
> like the first provision request is not yet satisfied.
> The {{MaxRetriesTestRun.maxRetriesWithIncreasedInstances}} method can be used 
> to reproduce this case intermittently by changing the {{allRunning.await}} 
> check to something that does a countdown latch {{onRunning}} as 
> {{EchoServerTestRun}} does.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to