[ 
https://issues.apache.org/jira/browse/YARN-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16477665#comment-16477665
 ] 

Eric Yang edited comment on YARN-8080 at 5/16/18 4:19 PM:
----------------------------------------------------------

Thank you for the patch, [~suma.shivaprasad].

{quote}
Can you please explain which part of code you are referring to? Or was it found 
during testing?{quote}

This was found during testing, and review of code.  The decision making process 
is based on 

{code}
nSucceeded + nFailed < comp.getComponentSpec().getNumberOfContainers()
{code}

If a user specifies 2 containers, and purposely failed containers.  The first 
failed container will trigger retries once.  The second container failed.  The 
total failed containers are 3 because first container failed + second container 
failed + first container retires failed, which is greater than number of 
containers.  This triggers the program to terminate, and report FINISHED.  This 
is almost working for restart_policy=NEVER, and it should report FAILED if 
number of failed containers is greater than 50% of total containers.

For restart_policy=ON_FAILURE, we will want to compare the total succeed 
containers = getNumberOfContainers, otherwise continue to retry.  This helps 
the measurement to count toward success and best effort to retry.

For restart_policy=ALWAYS, shouldTerminate always = false.

Checkstyle still reports indentation and unused import problems.  It would be 
good to automate the clean up using IDE features.


was (Author: eyang):
[~suma.shivaprasad] {quote}
{quote}
restart_policy=ON_FAILURE, and each component instance failed 3 times, and 
application goes into FINISHED state instead of FAILED state. Is this 
expected?{quote}

Can you please explain which part of code you are referring to? Or was it found 
during testing?{quote}

This was found during testing, and review of code.  The decision making process 
is based on 

{code}
nSucceeded + nFailed < comp.getComponentSpec().getNumberOfContainers()
{code}

If a user specifies 2 containers, and purposely failed containers.  The first 
failed container will trigger retries once.  The second container failed.  The 
total failed containers are 3 because first container failed + second container 
failed + first container retires failed, which is greater than number of 
containers.  This triggers the program to terminate, and report FINISHED.  This 
is almost working for restart_policy=NEVER, and it should report FAILED if 
number of failed containers is greater than 50% of total containers.

For restart_policy=ON_FAILURE, we will want to compare the total succeed 
containers = getNumberOfContainers, otherwise continue to retry.  This helps 
the measurement to count toward success and best effort to retry.

For restart_policy=ALWAYS, shouldTerminate always = false.

Checkstyle still reports indentation and unused import problems.  It would be 
good to automate the clean up using IDE features.

> YARN native service should support component restart policy
> -----------------------------------------------------------
>
>                 Key: YARN-8080
>                 URL: https://issues.apache.org/jira/browse/YARN-8080
>             Project: Hadoop YARN
>          Issue Type: Task
>            Reporter: Wangda Tan
>            Assignee: Suma Shivaprasad
>            Priority: Critical
>         Attachments: YARN-8080.001.patch, YARN-8080.002.patch, 
> YARN-8080.003.patch, YARN-8080.005.patch, YARN-8080.006.patch, 
> YARN-8080.007.patch, YARN-8080.009.patch, YARN-8080.010.patch, 
> YARN-8080.011.patch, YARN-8080.012.patch, YARN-8080.013.patch, 
> YARN-8080.014.patch, YARN-8080.015.patch
>
>
> Existing native service assumes the service is long running and never 
> finishes. Containers will be restarted even if exit code == 0. 
> To support boarder use cases, we need to allow restart policy of component 
> specified by users. Propose to have following policies:
> 1) Always: containers always restarted by framework regardless of container 
> exit status. This is existing/default behavior.
> 2) Never: Do not restart containers in any cases after container finishes: To 
> support job-like workload (for example Tensorflow training job). If a task 
> exit with code == 0, we should not restart the task. This can be used by 
> services which is not restart/recovery-able.
> 3) On-failure: Similar to above, only restart task with exitcode != 0. 
> Behaviors after component *instance* finalize (Succeeded or Failed when 
> restart_policy != ALWAYS): 
> 1) For single component, single instance: complete service.
> 2) For single component, multiple instance: other running instances from the 
> same component won't be affected by the finalized component instance. Service 
> will be terminated once all instances finalized. 
> 3) For multiple components: Service will be terminated once all components 
> finalized.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to