[ https://issues.apache.org/jira/browse/YARN-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16477665#comment-16477665 ]
Eric Yang commented on YARN-8080: --------------------------------- [~suma.shivaprasad] {quote} {quote} restart_policy=ON_FAILURE, and each component instance failed 3 times, and application goes into FINISHED state instead of FAILED state. Is this expected?{quote} Can you please explain which part of code you are referring to? Or was it found during testing?{quote} This was found during testing, and review of code. The decision making process is based on {code} nSucceeded + nFailed < comp.getComponentSpec().getNumberOfContainers() {code} If a user specifies 2 containers, and purposely failed containers. The first failed container will trigger retries once. The second container failed. The total failed containers are 3 because first container failed + second container failed + first container retires failed, which is greater than number of containers. This triggers the program to terminate, and report FINISHED. This is almost working for restart_policy=NEVER, and it should report FAILED if number of failed containers is greater than 50% of total containers. For restart_policy=ON_FAILURE, we will want to compare the total succeed containers = getNumberOfContainers, otherwise continue to retry. This helps the measurement to count toward success and best effort to retry. For restart_policy=ALWAYS, shouldTerminate always = false. Checkstyle still reports indentation and unused import problems. It would be good to automate the clean up using IDE features. > YARN native service should support component restart policy > ----------------------------------------------------------- > > Key: YARN-8080 > URL: https://issues.apache.org/jira/browse/YARN-8080 > Project: Hadoop YARN > Issue Type: Task > Reporter: Wangda Tan > Assignee: Suma Shivaprasad > Priority: Critical > Attachments: YARN-8080.001.patch, YARN-8080.002.patch, > YARN-8080.003.patch, YARN-8080.005.patch, YARN-8080.006.patch, > YARN-8080.007.patch, YARN-8080.009.patch, YARN-8080.010.patch, > YARN-8080.011.patch, YARN-8080.012.patch, YARN-8080.013.patch, > YARN-8080.014.patch, YARN-8080.015.patch > > > Existing native service assumes the service is long running and never > finishes. Containers will be restarted even if exit code == 0. > To support boarder use cases, we need to allow restart policy of component > specified by users. Propose to have following policies: > 1) Always: containers always restarted by framework regardless of container > exit status. This is existing/default behavior. > 2) Never: Do not restart containers in any cases after container finishes: To > support job-like workload (for example Tensorflow training job). If a task > exit with code == 0, we should not restart the task. This can be used by > services which is not restart/recovery-able. > 3) On-failure: Similar to above, only restart task with exitcode != 0. > Behaviors after component *instance* finalize (Succeeded or Failed when > restart_policy != ALWAYS): > 1) For single component, single instance: complete service. > 2) For single component, multiple instance: other running instances from the > same component won't be affected by the finalized component instance. Service > will be terminated once all instances finalized. > 3) For multiple components: Service will be terminated once all components > finalized. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org