[
https://issues.apache.org/jira/browse/SUBMARINE-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
cdmikechen reassigned SUBMARINE-1376:
-------------------------------------
Assignee: cdmikechen
> XGBoost experiment pods will be deleted so that submarine can not get logs
> --------------------------------------------------------------------------
>
> Key: SUBMARINE-1376
> URL: https://issues.apache.org/jira/browse/SUBMARINE-1376
> Project: Apache Submarine
> Issue Type: Bug
> Components: experiment
> Reporter: cdmikechen
> Assignee: cdmikechen
> Priority: Blocker
> Attachments: submarine-xgboost-pods.jpg
>
>
> After submitting the xgboost task using the following json, submarine was
> able to monitor the status of the xgboost task correctly.
> POST http://127.0.0.1:32080/api/v1/experiment
> {code:json}
> {
> "meta": {
> "name": "xgboost-example",
> "tags": [],
> "framework": "Xgboost",
> "cmd": "python /opt/mlkube/main.py --job_type=Train
> --xgboost_parameter=objective:multi:softprob,num_class:3 --n_estimators=10
> --learning_rate=0.1 --model_path=/tmp/xgboost-model
> --model_storage_type=local",
> "envVars": {}
> },
> "environment": {
> "image": "docker.io/merlintang/xgboost-dist-iris:1.1"
> },
> "spec": {
> "Worker": {
> "replicas": 2,
> "resources": "cpu=0.5,nvidia.com/gpu=0,memory=512M"
> },
> "Master": {
> "replicas": 1,
> "resources": "cpu=0.5,nvidia.com/gpu=0,memory=512M"
> }
> }
> }
> {code}
> However, after the task was finished, it was found that the training-operator
> deleted the pods. This caused submarine to be unable to confirm the names of
> the pods that had been executed and the logging status of each pod.
> I had checked training-operator(1.4.0) and found logs:
> {code}
> time="2023-04-01T09:26:31Z" level=info msg="xgboostJob
> experiment-1680334381873-0006 is created."
> time="2023-04-01T09:26:31Z" level=info msg="Reconciling for job
> experiment-1680334381873-0006"
> time="2023-04-01T09:26:31Z" level=info msg="Need to create new pod: worker-0"
> job=submarine.experiment-1680334381873-0006 replica-type=worker
> uid=20673c7b-e336-4ab0-b584-7453bc6b3234
> time="2023-04-01T09:26:31Z" level=info msg="Controller
> experiment-1680334381873-0006 created pod
> experiment-1680334381873-0006-worker-0" job=.experiment-1680334381873-0006
> pod=.experiment-1680334381873-0006-worker-0 uid=
> time="2023-04-01T09:26:31Z" level=info msg="Need to create new pod: worker-1"
> job=submarine.experiment-1680334381873-0006 replica-type=worker
> uid=20673c7b-e336-4ab0-b584-7453bc6b3234
> 2023-04-01T09:26:31.270Z DEBUG controller-runtime.manager.events
> Normal {"object":
> {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40565"},
> "reason": "SuccessfulCreatePod", "message": "Created pod:
> experiment-1680334381873-0006-worker-0"}
> time="2023-04-01T09:26:31Z" level=info msg="Controller
> experiment-1680334381873-0006 created pod
> experiment-1680334381873-0006-worker-1" job=.experiment-1680334381873-0006
> pod=.experiment-1680334381873-0006-worker-1 uid=
> time="2023-04-01T09:26:31Z" level=info msg="need to create new service:
> Worker-0" job=submarine.experiment-1680334381873-0006 replica-type=worker
> uid=20673c7b-e336-4ab0-b584-7453bc6b3234
> 2023-04-01T09:26:31.307Z DEBUG controller-runtime.manager.events
> Normal {"object":
> {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40565"},
> "reason": "SuccessfulCreatePod", "message": "Created pod:
> experiment-1680334381873-0006-worker-1"}
> time="2023-04-01T09:26:31Z" level=info msg="Controller
> experiment-1680334381873-0006 created service
> experiment-1680334381873-0006-worker-0"
> time="2023-04-01T09:26:31Z" level=info msg="need to create new service:
> Worker-1" job=submarine.experiment-1680334381873-0006 replica-type=worker
> uid=20673c7b-e336-4ab0-b584-7453bc6b3234
> 2023-04-01T09:26:31.344Z DEBUG controller-runtime.manager.events
> Normal {"object":
> {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40565"},
> "reason": "SuccessfulCreateService", "message": "Created service:
> experiment-1680334381873-0006-worker-0"}
> time="2023-04-01T09:26:31Z" level=info msg="Controller
> experiment-1680334381873-0006 created service
> experiment-1680334381873-0006-worker-1"
> time="2023-04-01T09:26:31Z" level=info msg="Need to create new pod: master-0"
> job=submarine.experiment-1680334381873-0006 replica-type=master
> uid=20673c7b-e336-4ab0-b584-7453bc6b3234
> 2023-04-01T09:26:31.410Z DEBUG controller-runtime.manager.events
> Normal {"object":
> {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40565"},
> "reason": "SuccessfulCreateService", "message": "Created service:
> experiment-1680334381873-0006-worker-1"}
> time="2023-04-01T09:26:31Z" level=info msg="Controller
> experiment-1680334381873-0006 created pod
> experiment-1680334381873-0006-master-0" job=.experiment-1680334381873-0006
> pod=.experiment-1680334381873-0006-master-0 uid=
> time="2023-04-01T09:26:31Z" level=info msg="need to create new service:
> Master-0" job=submarine.experiment-1680334381873-0006 replica-type=master
> uid=20673c7b-e336-4ab0-b584-7453bc6b3234
> 2023-04-01T09:26:31.462Z DEBUG controller-runtime.manager.events
> Normal {"object":
> {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40565"},
> "reason": "SuccessfulCreatePod", "message": "Created pod:
> experiment-1680334381873-0006-master-0"}
> time="2023-04-01T09:26:31Z" level=info msg="Controller
> experiment-1680334381873-0006 created service
> experiment-1680334381873-0006-master-0"
> time="2023-04-01T09:26:31Z" level=info
> msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1,
> running=0, succeeded=0 , failed=0"
> time="2023-04-01T09:26:31Z" level=info
> msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2,
> running=0, succeeded=0 , failed=0"
> time="2023-04-01T09:26:31Z" level=info msg="XGBoostJob
> experiment-1680334381873-0006 is running."
> job=submarine.experiment-1680334381873-0006
> uid=20673c7b-e336-4ab0-b584-7453bc6b3234
> 2023-04-01T09:26:31.487Z DEBUG controller-runtime.manager.events
> Normal {"object":
> {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40565"},
> "reason": "SuccessfulCreateService", "message": "Created service:
> experiment-1680334381873-0006-master-0"}
> time="2023-04-01T09:26:31Z" level=info msg="Reconciling for job
> experiment-1680334381873-0006"
> time="2023-04-01T09:26:31Z" level=info
> msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1,
> running=0, succeeded=0 , failed=0"
> time="2023-04-01T09:26:31Z" level=info
> msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2,
> running=0, succeeded=0 , failed=0"
> time="2023-04-01T09:26:31Z" level=info msg="XGBoostJob
> experiment-1680334381873-0006 is running."
> job=submarine.experiment-1680334381873-0006
> uid=20673c7b-e336-4ab0-b584-7453bc6b3234
> time="2023-04-01T09:26:31Z" level=error msg="Operation cannot be fulfilled on
> xgboostjobs.kubeflow.org \"experiment-1680334381873-0006\": the object has
> been modified; please apply your changes to the latest version and try
> againfailed to update XGBoost Job conditions in the API server"
> job=submarine.experiment-1680334381873-0006
> uid=20673c7b-e336-4ab0-b584-7453bc6b3234
> 2023-04-01T09:26:31.538Z ERROR controllers.XGBoostJob Reconcile
> XGBoost Job error {"xgboostjob":
> "submarine/experiment-1680334381873-0006", "error": "Operation cannot be
> fulfilled on xgboostjobs.kubeflow.org \"experiment-1680334381873-0006\": the
> object has been modified; please apply your changes to the latest version and
> try again"}
> 2023-04-01T09:26:31.538Z ERROR
> controller-runtime.manager.controller.xgboostjob-controller Reconciler
> error {"name": "experiment-1680334381873-0006", "namespace":
> "submarine", "error": "Operation cannot be fulfilled on
> xgboostjobs.kubeflow.org \"experiment-1680334381873-0006\": the object has
> been modified; please apply your changes to the latest version and try again"}
> time="2023-04-01T09:26:31Z" level=info msg="Reconciling for job
> experiment-1680334381873-0006"
> time="2023-04-01T09:26:31Z" level=info
> msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1,
> running=0, succeeded=0 , failed=0"
> time="2023-04-01T09:26:31Z" level=info
> msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2,
> running=0, succeeded=0 , failed=0"
> time="2023-04-01T09:26:31Z" level=info msg="XGBoostJob
> experiment-1680334381873-0006 is running."
> job=submarine.experiment-1680334381873-0006
> uid=20673c7b-e336-4ab0-b584-7453bc6b3234
> time="2023-04-01T09:26:31Z" level=info msg="Reconciling for job
> experiment-1680334381873-0006"
> time="2023-04-01T09:26:31Z" level=info
> msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1,
> running=0, succeeded=0 , failed=0"
> time="2023-04-01T09:26:31Z" level=info
> msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2,
> running=0, succeeded=0 , failed=0"
> time="2023-04-01T09:26:31Z" level=info msg="XGBoostJob
> experiment-1680334381873-0006 is running."
> job=submarine.experiment-1680334381873-0006
> uid=20673c7b-e336-4ab0-b584-7453bc6b3234
> time="2023-04-01T09:26:33Z" level=info msg="Reconciling for job
> experiment-1680334381873-0006"
> time="2023-04-01T09:26:33Z" level=info
> msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1,
> running=0, succeeded=0 , failed=0"
> time="2023-04-01T09:26:33Z" level=info
> msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2,
> running=1, succeeded=0 , failed=0"
> time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob
> experiment-1680334381873-0006 is running."
> job=submarine.experiment-1680334381873-0006
> uid=20673c7b-e336-4ab0-b584-7453bc6b3234
> time="2023-04-01T09:26:33Z" level=info msg="Reconciling for job
> experiment-1680334381873-0006"
> time="2023-04-01T09:26:33Z" level=info
> msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1,
> running=0, succeeded=0 , failed=0"
> time="2023-04-01T09:26:33Z" level=info
> msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2,
> running=1, succeeded=0 , failed=0"
> time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob
> experiment-1680334381873-0006 is running."
> job=submarine.experiment-1680334381873-0006
> uid=20673c7b-e336-4ab0-b584-7453bc6b3234
> time="2023-04-01T09:26:33Z" level=info msg="Reconciling for job
> experiment-1680334381873-0006"
> time="2023-04-01T09:26:33Z" level=info
> msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1,
> running=1, succeeded=0 , failed=0"
> time="2023-04-01T09:26:33Z" level=info
> msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2,
> running=1, succeeded=0 , failed=0"
> time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob
> experiment-1680334381873-0006 is running."
> job=submarine.experiment-1680334381873-0006
> uid=20673c7b-e336-4ab0-b584-7453bc6b3234
> time="2023-04-01T09:26:33Z" level=info msg="Reconciling for job
> experiment-1680334381873-0006"
> time="2023-04-01T09:26:33Z" level=info
> msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1,
> running=1, succeeded=0 , failed=0"
> time="2023-04-01T09:26:33Z" level=info
> msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2,
> running=1, succeeded=0 , failed=0"
> time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob
> experiment-1680334381873-0006 is running."
> job=submarine.experiment-1680334381873-0006
> uid=20673c7b-e336-4ab0-b584-7453bc6b3234
> time="2023-04-01T09:26:33Z" level=info msg="Reconciling for job
> experiment-1680334381873-0006"
> time="2023-04-01T09:26:33Z" level=info
> msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2,
> running=2, succeeded=0 , failed=0"
> time="2023-04-01T09:26:33Z" level=info
> msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1,
> running=1, succeeded=0 , failed=0"
> time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob
> experiment-1680334381873-0006 is running."
> job=submarine.experiment-1680334381873-0006
> uid=20673c7b-e336-4ab0-b584-7453bc6b3234
> time="2023-04-01T09:26:33Z" level=info msg="Reconciling for job
> experiment-1680334381873-0006"
> time="2023-04-01T09:26:33Z" level=info
> msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1,
> running=1, succeeded=0 , failed=0"
> time="2023-04-01T09:26:33Z" level=info
> msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2,
> running=2, succeeded=0 , failed=0"
> time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob
> experiment-1680334381873-0006 is running."
> job=submarine.experiment-1680334381873-0006
> uid=20673c7b-e336-4ab0-b584-7453bc6b3234
> time="2023-04-01T09:27:04Z" level=info msg="Reconciling for job
> experiment-1680334381873-0006"
> time="2023-04-01T09:27:04Z" level=info msg="Ignoring inactive pod
> submarine/experiment-1680334381873-0006-master-0 in state Succeeded, deletion
> time <nil>"
> time="2023-04-01T09:27:04Z" level=info msg="Pod:
> submarine.experiment-1680334381873-0006-master-0 exited with code 0"
> job=submarine.experiment-1680334381873-0006 replica-type=master
> uid=20673c7b-e336-4ab0-b584-7453bc6b3234
> time="2023-04-01T09:27:04Z" level=info
> msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=0,
> running=0, succeeded=1 , failed=0"
> time="2023-04-01T09:27:04Z" level=info msg="XGBoostJob
> experiment-1680334381873-0006 is successfully completed."
> 2023-04-01T09:27:04.010Z DEBUG controller-runtime.manager.events
> Normal {"object":
> {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40623"},
> "reason": "ExitedWithCode", "message": "Pod:
> submarine.experiment-1680334381873-0006-master-0 exited with code 0"}
> 2023-04-01T09:27:04.010Z DEBUG controller-runtime.manager.events
> Normal {"object":
> {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40623"},
> "reason": "XGBoostJobSucceeded", "message": "XGBoostJob
> experiment-1680334381873-0006 is successfully completed."}
> time="2023-04-01T09:27:04Z" level=info msg="Reconciling for job
> experiment-1680334381873-0006"
> time="2023-04-01T09:27:04Z" level=info msg="Controller
> experiment-1680334381873-0006 deleting pod
> submarine/experiment-1680334381873-0006-worker-1"
> job=submarine.experiment-1680334381873-0006
> uid=20673c7b-e336-4ab0-b584-7453bc6b3234
> 2023-04-01T09:27:04.067Z DEBUG controller-runtime.manager.events
> Normal {"object":
> {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40677"},
> "reason": "SuccessfulDeletePod", "message": "Deleted pod:
> experiment-1680334381873-0006-worker-1"}
> time="2023-04-01T09:27:04Z" level=info msg="Controller
> experiment-1680334381873-0006 deleting service
> submarine/experiment-1680334381873-0006-worker-1"
> 2023-04-01T09:27:04.113Z DEBUG controller-runtime.manager.events
> Normal {"object":
> {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40677"},
> "reason": "SuccessfulDeleteService", "message": "Deleted service:
> experiment-1680334381873-0006-worker-1"}
> time="2023-04-01T09:27:04Z" level=info msg="Controller
> experiment-1680334381873-0006 deleting pod
> submarine/experiment-1680334381873-0006-worker-0"
> job=submarine.experiment-1680334381873-0006
> uid=20673c7b-e336-4ab0-b584-7453bc6b3234
> 2023-04-01T09:27:04.145Z DEBUG controller-runtime.manager.events
> Normal {"object":
> {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40677"},
> "reason": "SuccessfulDeletePod", "message": "Deleted pod:
> experiment-1680334381873-0006-worker-0"}
> time="2023-04-01T09:27:04Z" level=info msg="Controller
> experiment-1680334381873-0006 deleting service
> submarine/experiment-1680334381873-0006-worker-0"
> 2023-04-01T09:27:04.162Z DEBUG controller-runtime.manager.events
> Normal {"object":
> {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40677"},
> "reason": "SuccessfulDeleteService", "message": "Deleted service:
> experiment-1680334381873-0006-worker-0"}
> time="2023-04-01T09:27:04Z" level=info msg="Controller
> experiment-1680334381873-0006 deleting pod
> submarine/experiment-1680334381873-0006-master-0"
> job=submarine.experiment-1680334381873-0006
> uid=20673c7b-e336-4ab0-b584-7453bc6b3234
> 2023-04-01T09:27:04.175Z DEBUG controller-runtime.manager.events
> Normal {"object":
> {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40677"},
> "reason": "SuccessfulDeletePod", "message": "Deleted pod:
> experiment-1680334381873-0006-master-0"}
> time="2023-04-01T09:27:04Z" level=info msg="Controller
> experiment-1680334381873-0006 deleting service
> submarine/experiment-1680334381873-0006-master-0"
> 2023-04-01T09:27:04.185Z DEBUG controller-runtime.manager.events
> Normal {"object":
> {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40677"},
> "reason": "SuccessfulDeleteService", "message": "Deleted service:
> experiment-1680334381873-0006-master-0"}
> time="2023-04-01T09:27:04Z" level=info msg="Reconciling for job
> experiment-1680334381873-0006"
> time="2023-04-01T09:27:04Z" level=info msg="pod
> submarine/experiment-1680334381873-0006-worker-1 is terminating, skip
> deleting" job=submarine.experiment-1680334381873-0006
> uid=20673c7b-e336-4ab0-b584-7453bc6b3234
> time="2023-04-01T09:27:13Z" level=info msg="Reconciling for job
> experiment-1680334381873-0006"
> time="2023-04-01T09:27:13Z" level=info msg="pod
> submarine/experiment-1680334381873-0006-worker-1 is terminating, skip
> deleting" job=submarine.experiment-1680334381873-0006
> uid=20673c7b-e336-4ab0-b584-7453bc6b3234
> time="2023-04-01T09:27:13Z" level=info msg="Reconciling for job
> experiment-1680334381873-0006"
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]