[ https://issues.apache.org/jira/browse/TWILL-211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15849407#comment-15849407 ]
ASF GitHub Bot commented on TWILL-211: -------------------------------------- Github user poornachandra commented on the issue: https://github.com/apache/twill/pull/29 LGTM from me too > Retries of failed runnable instances may result in unsatisfiable provisioning > requests > -------------------------------------------------------------------------------------- > > Key: TWILL-211 > URL: https://issues.apache.org/jira/browse/TWILL-211 > Project: Apache Twill > Issue Type: Bug > Components: core > Affects Versions: 0.9.0 > Reporter: Martin Serrano > Assignee: Martin Serrano > Priority: Critical > Fix For: 0.10.0 > > > In my investigation into the intermittent failures of tests for TWILL-181 I > discovered this bug. This code (starting on line 703 of > ApplicationMasterService): > {code} > if (expectedContainers.getExpected(runnableName) == > runningContainers.count(runnableName) || > > provisioning.peek().getType().equals(AllocationSpecification.Type.ALLOCATE_ONE_INSTANCE_AT_A_TIME)) > { > provisioning.poll(); > } > {code} > There is a case when instances are failing (but not simultaneously) where the > retries for the instances will be spread over two invocations of > `ApplicationMasterService.handleCompleted`. This means they will be part of > separate `RunnableContainerRequests` and thus will be provisioned separately. > But because the code above does not anticipate this case, the first > provisionRequest will never appear to be satisfied, never be polled and the > total can never be met. > The first provisionRequest does not appear to be satisfied because the > expected containers will never equal the running containers. The code as-is > expects every request to be an `ALLOCATE_ONE_INSTANCE_AT_A_TIME` or for all > instances. In the case of retries, requests may can in all at once or in > other patterns which result in multiple provision requests. > When retrying instances, the code should set the type to be > `ALLOCATE_ONE_INSTANCE_AT_A_TIME` to reflect the situation. -- This message was sent by Atlassian JIRA (v6.3.15#6346)