Yu-Tang Lin created SUBMARINE-952:
-------------------------------------
Summary: add the upper bound of retry counts for TFJob and
PytorchJob
Key: SUBMARINE-952
URL: https://issues.apache.org/jira/browse/SUBMARINE-952
Project: Apache Submarine
Issue Type: Sub-task
Reporter: Yu-Tang Lin
Assignee: Yu-Tang Lin
The TFJob and PytorchJob will retry when they hit some non-systematic error(for
example, OOM); if the same error shows up continually, the retry operation will
be never-enddng.
To prevent this case happens again, we would like to add *backoffLimit*
configuration both in TFJob and PytorchJob.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]