[ 
https://issues.apache.org/jira/browse/SUBMARINE-952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17394658#comment-17394658
 ] 

Yu-Tang Lin commented on SUBMARINE-952:
---------------------------------------

[~pingsutw]

here's the response when tfjob is failed due to backoffLimit, looks like they 
didn't provide specific error message of previous failure.

{code}

2021-08-06 17:32:43 INFO K8sSubmitter:212 - Upstream response JSON: 
\{"apiVersion":"kubeflow.org/v1","kind":"TFJob","metadata":{"creationTimestamp":"2021-08-06T09:29:55Z","generation":1.0,"labels":{"submarine-experiment-name":"experiment-e2e-test"},"name":"experiment-1628241467913-0001","namespace":"default","resourceVersion":"2620","selfLink":"/apis/kubeflow.org/v1/namespaces/default/tfjobs/experiment-1628241467913-0001","uid":"a6e2baf2-ed33-4f39-b6ea-7ddc17b80ff5"},"spec":\{"backoffLimit":3.0,"tfReplicaSpecs":{"Ps":{"replicas":1.0,"restartPolicy":"OnFailure","template":{"spec":{"containers":[{"command":["python","/var/tf_mnist/mnist_with_summaries.py","--log_dir\u003d/train/log","--learning_rate\u003d0.01","--batch_size\u003d150"],"env":[{"name":"ENV_1","value":"ENV1"},\{"name":"JOB_ID","value":"experiment-1628241467913-0001"},\{"name":"SUBMARINE_TRACKING_URI","value":"mysql+pymysql://submarine:password@submarine-database:3306/submarine"},\{"name":"SUBMARINE_TENSORBOARD_LOG_DIR","value":"/logs/mylog"},\{"name":"CODE_PATH","value":"/code"}],"image":"apache/submarine:tf-mnist-with-summaries-1.0","name":"tensorflow","resources":\{"limits":{"cpu":"1","memory":"256M","nvidia.com/gpu":"0"}},"volumeMounts":[\{"mountPath":"/code","name":"code-dir"}]}],"initContainers":[\{"env":[{"name":"GIT_SYNC_REPO","value":"https://github.com/apache/submarine.git"},\{"name":"GIT_SYNC_ROOT","value":"/code"},\{"name":"GIT_SYNC_DEST","value":"current"},\{"name":"GIT_SYNC_ONE_TIME","value":"true"}],"image":"apache/submarine:git-sync-3.1.6","name":"code-localizer","volumeMounts":[\{"mountPath":"/code","name":"code-dir"}]}],"volumes":[\{"name":"volume","persistentVolumeClaim":{"claimName":"submarine-tensorboard-pvc"}},\{"emptyDir":{},"name":"code-dir"}]}}},"Worker":\{"replicas":1.0,"restartPolicy":"OnFailure","template":{"spec":{"containers":[{"command":["python","/var/tf_mnist/mnist_with_summaries.py","--log_dir\u003d/train/log","--learning_rate\u003d0.01","--batch_size\u003d150"],"env":[{"name":"ENV_1","value":"ENV1"},\{"name":"JOB_ID","value":"experiment-1628241467913-0001"},\{"name":"SUBMARINE_TRACKING_URI","value":"mysql+pymysql://submarine:password@submarine-database:3306/submarine"},\{"name":"SUBMARINE_TENSORBOARD_LOG_DIR","value":"/logs/mylog"},\{"name":"CODE_PATH","value":"/code"}],"image":"apache/submarine:tf-mnist-with-summaries-1.0","name":"tensorflow","resources":\{"limits":{"cpu":"1","memory":"256M","nvidia.com/gpu":"0"}},"volumeMounts":[\{"mountPath":"/code","name":"code-dir"}]}],"initContainers":[\{"env":[{"name":"GIT_SYNC_REPO","value":"https://github.com/apache/submarine.git"},\{"name":"GIT_SYNC_ROOT","value":"/code"},\{"name":"GIT_SYNC_DEST","value":"current"},\{"name":"GIT_SYNC_ONE_TIME","value":"true"}],"image":"apache/submarine:git-sync-3.1.6","name":"code-localizer","volumeMounts":[\{"mountPath":"/code","name":"code-dir"}]}],"volumes":[\{"name":"volume","persistentVolumeClaim":{"claimName":"submarine-tensorboard-pvc"}},\{"emptyDir":{},"name":"code-dir"}]}}}}},"status":\{"completionTime":"2021-08-06T09:32:03Z","conditions":[{"lastTransitionTime":"2021-08-06T09:29:55Z","lastUpdateTime":"2021-08-06T09:29:55Z","message":"TFJob
 experiment-1628241467913-0001 is 
created.","reason":"TFJobCreated","status":"True","type":"Created"},\{"lastTransitionTime":"2021-08-06T09:31:32Z","lastUpdateTime":"2021-08-06T09:31:32Z","message":"TFJob
 experiment-1628241467913-0001 is 
running.","reason":"TFJobRunning","status":"False","type":"Running"},\{"lastTransitionTime":"2021-08-06T09:32:03Z","lastUpdateTime":"2021-08-06T09:32:03Z","message":"TFJob
 experiment-1628241467913-0001 has failed because it has reached the specified 
backoff 
limit","reason":"TFJobFailed","status":"True","type":"Failed"}],"replicaStatuses":\{"PS":{"active":1.0},"Worker":\{"active":1.0}},"startTime":"2021-08-06T09:29:55Z"}}
{code}

> add the upper bound  of retry counts for TFJob and PytorchJob
> -------------------------------------------------------------
>
>                 Key: SUBMARINE-952
>                 URL: https://issues.apache.org/jira/browse/SUBMARINE-952
>             Project: Apache Submarine
>          Issue Type: Sub-task
>            Reporter: Yu-Tang Lin
>            Assignee: Yu-Tang Lin
>            Priority: Minor
>              Labels: pull-request-available
>
> The TFJob and PytorchJob will retry when they hit some non-systematic 
> error(for example, OOM); if the same error shows up continually, the retry 
> operation will be never-enddng.
> To prevent this case happens again, we would like to add *backoffLimit* 
> configuration both in TFJob and PytorchJob.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to