[ https://issues.apache.org/jira/browse/YUNIKORN-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885474#comment-17885474 ]
Craig Condit commented on YUNIKORN-2884: ---------------------------------------- I'm not sure terminating the task is how we should handle this, as that makes debugging (via events) difficult. We should be looking into how to recover from this and re-schedule the pod. > Task fail with post allocated but the pod will keep pending > ----------------------------------------------------------- > > Key: YUNIKORN-2884 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2884 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: shim - kubernetes > Reporter: Qi Zhu > Assignee: Qi Zhu > Priority: Major > Labels: pull-request-available > > We will fail task post allocated, but we don't update the pod to terminal > state. > For example we bind pod volume failed post allocated, the pod will not go to > terminal state, it will fail: > Pod event: > {code:java} > Events: > Type Reason Age From Message > ---- ------ ---- ---- ------- > Normal Scheduling 30s yunikorn dev-nnjxy/pod-btv0y is > queued and waiting for allocation > Normal Scheduled 30s yunikorn Successfully assigned > dev-nnjxy/pod-btv0y to node yktest-worker > Warning PodVolumesBindFailure 20s yunikorn bind volumes to pod failed, > name: dev-nnjxy/pod-btv0y, binding volumes: context deadline exceeded > Normal TaskFailed 20s yunikorn Task dev-nnjxy/pod-btv0y is > failed{code} > Pod pending not going to terminal state > {code:java} > 2024-09-20T11:22:27.601Z INFO shim.fsm cache/task_state.go:381 > Task state transition {"app": "yunikorn-dev-03c96-autogen", "task": > "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "taskAlias": "dev-03c96/pod-bgg9h", > "source": "Scheduling", "destination": "Allocated", "event": "TaskAllocated"} > 2024-09-20T11:22:37.606Z DEBUG shim.cache.task cache/task.go:499 > prepare to send release request {"applicationID": > "yunikorn-dev-03c96-autogen", "taskID": > "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "taskAlias": "dev-03c96/pod-bgg9h", > "allocationKey": "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "task": "Allocated", > "terminationType": ""} > 2024-09-20T11:22:37.606Z DEBUG core.scheduler > scheduler/scheduler.go:117 enqueued event {"eventType": > "*rmevent.RMUpdateAllocationEvent", "event": > {"Request":{"releases":{"allocationsToRelease":[{"partitionName":"[mycluster]default","applicationID":"yunikorn-dev-03c96-autogen","terminationType":1,"message":"task > > completed","allocationKey":"6f3dd7fa-72b4-40cf-a700-43e51394a06b"}]},"rmID":"mycluster"}}, > "currentQueueSize": 0} > 2024-09-20T11:22:37.606Z ERROR shim.cache.task cache/task.go:475 > task failed {"appID": "yunikorn-dev-03c96-autogen", "taskID": > "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "reason": "bind volumes to pod > failed, name: dev-03c96/pod-bgg9h, binding volumes: context deadline > exceeded"} > 2024-09-20T11:22:37.606Z INFO shim.fsm cache/task_state.go:381 > Task state transition {"app": "yunikorn-dev-03c96-autogen", "task": > "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "taskAlias": "dev-03c96/pod-bgg9h", > "source": "Allocated", "destination": "Failed", "event": "TaskFail"} > 2024-09-20T11:22:37.606Z INFO core.scheduler.partition > scheduler/partition.go:1359 removing allocation from application > {"appID": "yunikorn-dev-03c96-autogen", "allocationKey": > "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "terminationType": "STOPPED_BY_RM"} > 2024-09-20T11:22:37.606Z DEBUG core.scheduler.ugm ugm/manager.go:132 > Decreasing resource usage {"user": "kubernetes-admin", "queue path": > "root.dev-03c96", "application": "yunikorn-dev-03c96-autogen", "resource": > "map[pods:1]", "removeApp": true} > 2024-09-20T11:22:37.606Z DEBUG core.scheduler.ugm ugm/manager.go:152 > Decreasing resource usage for user {"user": "kubernetes-admin", "queue > path": "root.dev-03c96", "application": "yunikorn-dev-03c96-autogen", > "group": "", "resource": "map[pods:1]", "removeApp": true} > 2024-09-20T11:22:37.606Z DEBUG core.scheduler.ugm > ugm/queue_tracker.go:132 Decreasing resource usage {"queue path": > "root", "hierarchy": ["root", "dev-03c96"], "application": > "yunikorn-dev-03c96-autogen", "resource": "map[pods:1]", "removeApp": true} > 2024-09-20T11:22:37.607Z DEBUG core.scheduler.ugm > ugm/queue_tracker.go:132 Decreasing resource usage {"queue path": > "root.dev-03c96", "hierarchy": ["dev-03c96"], "application": > "yunikorn-dev-03c96-autogen", "resource": "map[pods:1]", "removeApp": true} > 2024-09-20T11:22:37.607Z DEBUG core.scheduler.ugm > ugm/queue_tracker.go:159 Removed application from running applications > {"application": "yunikorn-dev-03c96-autogen", "queue path": "root.dev-03c96", > "queue name": "dev-03c96"} > 2024-09-20T11:22:37.608Z DEBUG core.scheduler.ugm > ugm/queue_tracker.go:165 Successfully decreased resource usage {"queue > path": "root.dev-03c96", "application": "yunikorn-dev-03c96-autogen", > "resource": "map[pods:1]", "total resource after decreasing": "map[]", "total > applications after decreasing": 0} > 2024-09-20T11:22:37.608Z DEBUG core.scheduler.ugm > ugm/queue_tracker.go:159 Removed application from running applications > {"application": "yunikorn-dev-03c96-autogen", "queue path": "root", "queue > name": "root"} > 2024-09-20T11:22:37.608Z DEBUG core.scheduler.ugm > ugm/queue_tracker.go:165 Successfully decreased resource usage {"queue > path": "root", "application": "yunikorn-dev-03c96-autogen", "resource": > "map[pods:1]", "total resource after decreasing": "map[]", "total > applications after decreasing": 0} > 2024-09-20T11:22:37.608Z DEBUG core.scheduler.application > objects/application.go:336 Application state timer initiated {"appID": > "yunikorn-dev-03c96-autogen", "state": "Completing", "timeout": "30s"} > 2024-09-20T11:22:37.608Z INFO core.scheduler.fsm > objects/application_state.go:147 Application state transition {"appID": > "yunikorn-dev-03c96-autogen", "source": "Running", "destination": > "Completing", "event": "completeApplication"} > 2024-09-20T11:22:37.608Z DEBUG core.rmproxy rmproxy/rmproxy.go:60 > enqueue event {"eventType": "*rmevent.RMApplicationUpdateEvent", "event": > {"RmID":"mycluster","AcceptedApplications":[],"RejectedApplications":[],"UpdatedApplications":[{"applicationID":"yunikorn-dev-03c96-autogen","state":"Completing","stateTransitionTimestamp":1726831357608331511,"message":"completeApplication"}]}, > "currentQueueSize": 0} > 2024-09-20T11:22:37.608Z INFO core.scheduler.application > objects/application.go:615 ask removed successfully from application > {"appID": "yunikorn-dev-03c96-autogen", "ask": > "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "pendingDelta": "nil resource"} > 2024-09-20T11:22:37.608Z DEBUG core.rmproxy rmproxy/rmproxy.go:60 > enqueue event {"eventType": "*rmevent.RMReleaseAllocationEvent", "event": > {"RmID":"mycluster","ReleasedAllocations":[{"partitionName":"[mycluster]default","applicationID":"yunikorn-dev-03c96-autogen","terminationType":1,"message":"allocation > remove as per RM > request","allocationKey":"6f3dd7fa-72b4-40cf-a700-43e51394a06b"}]}, > "currentQueueSize": 1} > 2024-09-20T11:22:37.608Z DEBUG shim.rmcallback > cache/scheduler_callback.go:108 UpdateApplication callback received > {"UpdateApplicationResponse": > "updated:{applicationID:\"yunikorn-dev-03c96-autogen\" state:\"Completing\" > stateTransitionTimestamp:1726831357608331511 > message:\"completeApplication\"}"} > 2024-09-20T11:22:37.608Z DEBUG shim.rmcallback > cache/scheduler_callback.go:137 status update callback received > {"appId": "yunikorn-dev-03c96-autogen", "new status": "Completing"} > 2024-09-20T11:22:37.608Z DEBUG shim.rmcallback > cache/scheduler_callback.go:47 UpdateAllocation callback received > {"UpdateAllocationResponse": "released:{partitionName:\"[mycluster]default\" > applicationID:\"yunikorn-dev-03c96-autogen\" terminationType:STOPPED_BY_RM > message:\"allocation remove as per RM request\" > allocationKey:\"6f3dd7fa-72b4-40cf-a700-43e51394a06b\"}"} > 2024-09-20T11:22:38.605Z INFO shim.cache.application > cache/application.go:239 task removed {"appID": > "yunikorn-dev-03c96-autogen", "taskID": > "6f3dd7fa-72b4-40cf-a700-43e51394a06b"} > 2024-09-20T11:23:07.607Z DEBUG core.scheduler.application > objects/application.go:352 Application state: auto progress > {"applicationID": "yunikorn-dev-03c96-autogen", "state": "Completing"} > 2024-09-20T11:23:07.607Z DEBUG core.scheduler.application > objects/application.go:384 Application state timer cleared {"appID": > "yunikorn-dev-03c96-autogen", "state": "Completing"} > 2024-09-20T11:23:07.607Z DEBUG core.scheduler.application > objects/application.go:336 Application state timer initiated {"appID": > "yunikorn-dev-03c96-autogen", "state": "Completed", "timeout": "72h0m0s"} > 2024-09-20T11:23:07.607Z INFO core.scheduler.fsm > objects/application_state.go:147 Application state transition {"appID": > "yunikorn-dev-03c96-autogen", "source": "Completing", "destination": > "Completed", "event": "completeApplication"} > 2024-09-20T11:23:07.607Z DEBUG core.rmproxy rmproxy/rmproxy.go:60 > enqueue event {"eventType": "*rmevent.RMApplicationUpdateEvent", "event": > {"RmID":"mycluster","AcceptedApplications":[],"RejectedApplications":[],"UpdatedApplications":[{"applicationID":"yunikorn-dev-03c96-autogen","state":"Completed","stateTransitionTimestamp":1726831387607633721,"message":"completeApplication"}]}, > "currentQueueSize": 0} > 2024-09-20T11:23:07.607Z DEBUG shim.rmcallback > cache/scheduler_callback.go:108 UpdateApplication callback received > {"UpdateApplicationResponse": > "updated:{applicationID:\"yunikorn-dev-03c96-autogen\" state:\"Completed\" > stateTransitionTimestamp:1726831387607633721 > message:\"completeApplication\"}"} > 2024-09-20T11:23:07.607Z DEBUG shim.rmcallback > cache/scheduler_callback.go:137 status update callback received > {"appId": "yunikorn-dev-03c96-autogen", "new status": "Completed"} > 2024-09-20T11:23:07.607Z INFO core.scheduler.queue > objects/queue.go:830 Application completed and removed from queue > {"queueName": "root.dev-03c96", "applicationID": "yunikorn-dev-03c96-autogen"} > 2024-09-20T11:23:07.607Z INFO core.scheduler.partition > scheduler/partition.go:1539 Removing terminated application from the > application list {"appID": "yunikorn-dev-03c96-autogen", "app status": > "Completed"} > 2024-09-20T11:23:07.607Z INFO core.scheduler.application.usage > objects/application_summary.go:60 YK_APP_SUMMARY: {ApplicationID: > yunikorn-dev-03c96-autogen, SubmissionTime: 1726831345581, StartTime: > 1726831347597, FinishTime: 1726831387607, User: kubernetes-admin, Queue: > root.dev-03c96, State: Completed, RmID: mycluster, ResourceUsage: > TrackedResource{UNKNOWN:pods=10}, PreemptedResource: TrackedResource{}, > PlaceholderResource: TrackedResource{}} {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org