[ https://issues.apache.org/jira/browse/SUBMARINE-952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Work on SUBMARINE-952 started by Yu-Tang Lin. --------------------------------------------- > add the upper bound of retry counts for TFJob and PytorchJob > ------------------------------------------------------------- > > Key: SUBMARINE-952 > URL: https://issues.apache.org/jira/browse/SUBMARINE-952 > Project: Apache Submarine > Issue Type: Sub-task > Reporter: Yu-Tang Lin > Assignee: Yu-Tang Lin > Priority: Minor > > The TFJob and PytorchJob will retry when they hit some non-systematic > error(for example, OOM); if the same error shows up continually, the retry > operation will be never-enddng. > To prevent this case happens again, we would like to add *backoffLimit* > configuration both in TFJob and PytorchJob. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@submarine.apache.org For additional commands, e-mail: dev-h...@submarine.apache.org