Github user serranom commented on the issue: https://github.com/apache/twill/pull/23 Gotcha. Here it is, cleaned up for what I actually did: * Each runnable can have a configured number of max retries. If not set, then retries are unlimited as before. * add withMaxTries(runnableName, int) to TwillPreparer * add withMaxTries(runnableName, int) to YarnTwillPreparer. This stores a map from runnableName to maxRetries. * this map becomes part of the twillRuntimeSpecification and RuntimeSpecification interface and is added to TwillRuntimeSpecificationCodec * ApplicationMasterService.initRunningContainers is updated to pass a map of runnables to maxretries. * updated RunningContainers so that it keeps count of the number of retries per runnable and uses this in handleCompleted() to determine if it should retry. Since every instance is the same as any other, if I'm starting 10 instances of a Runnable, and wanted a max retry count of 3, then that would scale the total number of retries to 30. Each instance gets (on average) 3 tries. Since the instances are interchangeable, there is no concept of a discrete instance being retried. * updated logging to not have anything special if max wasn't set and to log the number of retries left and when they have been exhausted.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---