[ 
https://issues.apache.org/jira/browse/TWILL-211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15849407#comment-15849407
 ] 

ASF GitHub Bot commented on TWILL-211:
--------------------------------------

Github user poornachandra commented on the issue:

    https://github.com/apache/twill/pull/29
  
    LGTM from me too


> Retries of failed runnable instances may result in unsatisfiable provisioning 
> requests
> --------------------------------------------------------------------------------------
>
>                 Key: TWILL-211
>                 URL: https://issues.apache.org/jira/browse/TWILL-211
>             Project: Apache Twill
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.9.0
>            Reporter: Martin Serrano
>            Assignee: Martin Serrano
>            Priority: Critical
>             Fix For: 0.10.0
>
>
> In my investigation into the intermittent failures of tests for TWILL-181 I 
> discovered this bug.  This code (starting on line 703 of 
> ApplicationMasterService):
> {code}
>  if (expectedContainers.getExpected(runnableName) == 
> runningContainers.count(runnableName) ||
>     
> provisioning.peek().getType().equals(AllocationSpecification.Type.ALLOCATE_ONE_INSTANCE_AT_A_TIME))
>  {
>     provisioning.poll();
>   }
> {code}
> There is a case when instances are failing (but not simultaneously) where the 
> retries for the instances will be spread over two invocations of 
> `ApplicationMasterService.handleCompleted`. This means they will be part of 
> separate `RunnableContainerRequests` and thus will be provisioned separately. 
> But because the code above does not anticipate this case, the first 
> provisionRequest will never appear to be satisfied, never be polled and the 
> total can never be met.
> The first provisionRequest does not appear to be satisfied because the 
> expected containers will never equal the running containers.  The code as-is 
> expects every request to be an `ALLOCATE_ONE_INSTANCE_AT_A_TIME` or for all 
> instances.  In the case of retries, requests may can in all at once or in 
> other patterns which result in multiple provision requests.
> When retrying instances, the code should set the type  to be 
> `ALLOCATE_ONE_INSTANCE_AT_A_TIME` to reflect the situation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to