[ 
https://issues.apache.org/jira/browse/YUNIKORN-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885474#comment-17885474
 ] 

Craig Condit commented on YUNIKORN-2884:
----------------------------------------

I'm not sure terminating the task is how we should handle this, as that makes 
debugging (via events) difficult. We should be looking into how to recover from 
this and re-schedule the pod.

 

> Task fail with post allocated but the pod will keep pending
> -----------------------------------------------------------
>
>                 Key: YUNIKORN-2884
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2884
>             Project: Apache YuniKorn
>          Issue Type: Sub-task
>          Components: shim - kubernetes
>            Reporter: Qi Zhu
>            Assignee: Qi Zhu
>            Priority: Major
>              Labels: pull-request-available
>
> We will fail task post allocated, but we don't update the pod to terminal 
> state.
> For example we bind pod volume failed post allocated, the pod will not go to 
> terminal state, it will fail:
> Pod event:
> {code:java}
> Events:
>   Type     Reason                 Age   From      Message
>   ----     ------                 ----  ----      -------
>   Normal   Scheduling             30s   yunikorn  dev-nnjxy/pod-btv0y is 
> queued and waiting for allocation
>   Normal   Scheduled              30s   yunikorn  Successfully assigned 
> dev-nnjxy/pod-btv0y to node yktest-worker
>   Warning  PodVolumesBindFailure  20s   yunikorn  bind volumes to pod failed, 
> name: dev-nnjxy/pod-btv0y, binding volumes: context deadline exceeded
>   Normal   TaskFailed             20s   yunikorn  Task dev-nnjxy/pod-btv0y is 
> failed{code}
> Pod pending not going to terminal state
> {code:java}
> 2024-09-20T11:22:27.601Z    INFO    shim.fsm    cache/task_state.go:381    
> Task state transition    {"app": "yunikorn-dev-03c96-autogen", "task": 
> "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "taskAlias": "dev-03c96/pod-bgg9h", 
> "source": "Scheduling", "destination": "Allocated", "event": "TaskAllocated"}
> 2024-09-20T11:22:37.606Z    DEBUG    shim.cache.task    cache/task.go:499    
> prepare to send release request    {"applicationID": 
> "yunikorn-dev-03c96-autogen", "taskID": 
> "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "taskAlias": "dev-03c96/pod-bgg9h", 
> "allocationKey": "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "task": "Allocated", 
> "terminationType": ""}
> 2024-09-20T11:22:37.606Z    DEBUG    core.scheduler    
> scheduler/scheduler.go:117    enqueued event    {"eventType": 
> "*rmevent.RMUpdateAllocationEvent", "event": 
> {"Request":{"releases":{"allocationsToRelease":[{"partitionName":"[mycluster]default","applicationID":"yunikorn-dev-03c96-autogen","terminationType":1,"message":"task
>  
> completed","allocationKey":"6f3dd7fa-72b4-40cf-a700-43e51394a06b"}]},"rmID":"mycluster"}},
>  "currentQueueSize": 0}
> 2024-09-20T11:22:37.606Z    ERROR    shim.cache.task    cache/task.go:475    
> task failed    {"appID": "yunikorn-dev-03c96-autogen", "taskID": 
> "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "reason": "bind volumes to pod 
> failed, name: dev-03c96/pod-bgg9h, binding volumes: context deadline 
> exceeded"}
> 2024-09-20T11:22:37.606Z    INFO    shim.fsm    cache/task_state.go:381    
> Task state transition    {"app": "yunikorn-dev-03c96-autogen", "task": 
> "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "taskAlias": "dev-03c96/pod-bgg9h", 
> "source": "Allocated", "destination": "Failed", "event": "TaskFail"}
> 2024-09-20T11:22:37.606Z    INFO    core.scheduler.partition    
> scheduler/partition.go:1359    removing allocation from application    
> {"appID": "yunikorn-dev-03c96-autogen", "allocationKey": 
> "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "terminationType": "STOPPED_BY_RM"}
> 2024-09-20T11:22:37.606Z    DEBUG    core.scheduler.ugm    ugm/manager.go:132 
>    Decreasing resource usage    {"user": "kubernetes-admin", "queue path": 
> "root.dev-03c96", "application": "yunikorn-dev-03c96-autogen", "resource": 
> "map[pods:1]", "removeApp": true}
> 2024-09-20T11:22:37.606Z    DEBUG    core.scheduler.ugm    ugm/manager.go:152 
>    Decreasing resource usage for user    {"user": "kubernetes-admin", "queue 
> path": "root.dev-03c96", "application": "yunikorn-dev-03c96-autogen", 
> "group": "", "resource": "map[pods:1]", "removeApp": true}
> 2024-09-20T11:22:37.606Z    DEBUG    core.scheduler.ugm    
> ugm/queue_tracker.go:132    Decreasing resource usage    {"queue path": 
> "root", "hierarchy": ["root", "dev-03c96"], "application": 
> "yunikorn-dev-03c96-autogen", "resource": "map[pods:1]", "removeApp": true}
> 2024-09-20T11:22:37.607Z    DEBUG    core.scheduler.ugm    
> ugm/queue_tracker.go:132    Decreasing resource usage    {"queue path": 
> "root.dev-03c96", "hierarchy": ["dev-03c96"], "application": 
> "yunikorn-dev-03c96-autogen", "resource": "map[pods:1]", "removeApp": true}
> 2024-09-20T11:22:37.607Z    DEBUG    core.scheduler.ugm    
> ugm/queue_tracker.go:159    Removed application from running applications    
> {"application": "yunikorn-dev-03c96-autogen", "queue path": "root.dev-03c96", 
> "queue name": "dev-03c96"}
> 2024-09-20T11:22:37.608Z    DEBUG    core.scheduler.ugm    
> ugm/queue_tracker.go:165    Successfully decreased resource usage    {"queue 
> path": "root.dev-03c96", "application": "yunikorn-dev-03c96-autogen", 
> "resource": "map[pods:1]", "total resource after decreasing": "map[]", "total 
> applications after decreasing": 0}
> 2024-09-20T11:22:37.608Z    DEBUG    core.scheduler.ugm    
> ugm/queue_tracker.go:159    Removed application from running applications    
> {"application": "yunikorn-dev-03c96-autogen", "queue path": "root", "queue 
> name": "root"}
> 2024-09-20T11:22:37.608Z    DEBUG    core.scheduler.ugm    
> ugm/queue_tracker.go:165    Successfully decreased resource usage    {"queue 
> path": "root", "application": "yunikorn-dev-03c96-autogen", "resource": 
> "map[pods:1]", "total resource after decreasing": "map[]", "total 
> applications after decreasing": 0}
> 2024-09-20T11:22:37.608Z    DEBUG    core.scheduler.application    
> objects/application.go:336    Application state timer initiated    {"appID": 
> "yunikorn-dev-03c96-autogen", "state": "Completing", "timeout": "30s"}
> 2024-09-20T11:22:37.608Z    INFO    core.scheduler.fsm    
> objects/application_state.go:147    Application state transition    {"appID": 
> "yunikorn-dev-03c96-autogen", "source": "Running", "destination": 
> "Completing", "event": "completeApplication"}
> 2024-09-20T11:22:37.608Z    DEBUG    core.rmproxy    rmproxy/rmproxy.go:60    
> enqueue event    {"eventType": "*rmevent.RMApplicationUpdateEvent", "event": 
> {"RmID":"mycluster","AcceptedApplications":[],"RejectedApplications":[],"UpdatedApplications":[{"applicationID":"yunikorn-dev-03c96-autogen","state":"Completing","stateTransitionTimestamp":1726831357608331511,"message":"completeApplication"}]},
>  "currentQueueSize": 0}
> 2024-09-20T11:22:37.608Z    INFO    core.scheduler.application    
> objects/application.go:615    ask removed successfully from application    
> {"appID": "yunikorn-dev-03c96-autogen", "ask": 
> "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "pendingDelta": "nil resource"}
> 2024-09-20T11:22:37.608Z    DEBUG    core.rmproxy    rmproxy/rmproxy.go:60    
> enqueue event    {"eventType": "*rmevent.RMReleaseAllocationEvent", "event": 
> {"RmID":"mycluster","ReleasedAllocations":[{"partitionName":"[mycluster]default","applicationID":"yunikorn-dev-03c96-autogen","terminationType":1,"message":"allocation
>  remove as per RM 
> request","allocationKey":"6f3dd7fa-72b4-40cf-a700-43e51394a06b"}]}, 
> "currentQueueSize": 1}
> 2024-09-20T11:22:37.608Z    DEBUG    shim.rmcallback    
> cache/scheduler_callback.go:108    UpdateApplication callback received    
> {"UpdateApplicationResponse": 
> "updated:{applicationID:\"yunikorn-dev-03c96-autogen\" state:\"Completing\" 
> stateTransitionTimestamp:1726831357608331511 
> message:\"completeApplication\"}"}
> 2024-09-20T11:22:37.608Z    DEBUG    shim.rmcallback    
> cache/scheduler_callback.go:137    status update callback received    
> {"appId": "yunikorn-dev-03c96-autogen", "new status": "Completing"}
> 2024-09-20T11:22:37.608Z    DEBUG    shim.rmcallback    
> cache/scheduler_callback.go:47    UpdateAllocation callback received    
> {"UpdateAllocationResponse": "released:{partitionName:\"[mycluster]default\" 
> applicationID:\"yunikorn-dev-03c96-autogen\" terminationType:STOPPED_BY_RM 
> message:\"allocation remove as per RM request\" 
> allocationKey:\"6f3dd7fa-72b4-40cf-a700-43e51394a06b\"}"}
> 2024-09-20T11:22:38.605Z    INFO    shim.cache.application    
> cache/application.go:239    task removed    {"appID": 
> "yunikorn-dev-03c96-autogen", "taskID": 
> "6f3dd7fa-72b4-40cf-a700-43e51394a06b"}
> 2024-09-20T11:23:07.607Z    DEBUG    core.scheduler.application    
> objects/application.go:352    Application state: auto progress    
> {"applicationID": "yunikorn-dev-03c96-autogen", "state": "Completing"}
> 2024-09-20T11:23:07.607Z    DEBUG    core.scheduler.application    
> objects/application.go:384    Application state timer cleared    {"appID": 
> "yunikorn-dev-03c96-autogen", "state": "Completing"}
> 2024-09-20T11:23:07.607Z    DEBUG    core.scheduler.application    
> objects/application.go:336    Application state timer initiated    {"appID": 
> "yunikorn-dev-03c96-autogen", "state": "Completed", "timeout": "72h0m0s"}
> 2024-09-20T11:23:07.607Z    INFO    core.scheduler.fsm    
> objects/application_state.go:147    Application state transition    {"appID": 
> "yunikorn-dev-03c96-autogen", "source": "Completing", "destination": 
> "Completed", "event": "completeApplication"}
> 2024-09-20T11:23:07.607Z    DEBUG    core.rmproxy    rmproxy/rmproxy.go:60    
> enqueue event    {"eventType": "*rmevent.RMApplicationUpdateEvent", "event": 
> {"RmID":"mycluster","AcceptedApplications":[],"RejectedApplications":[],"UpdatedApplications":[{"applicationID":"yunikorn-dev-03c96-autogen","state":"Completed","stateTransitionTimestamp":1726831387607633721,"message":"completeApplication"}]},
>  "currentQueueSize": 0}
> 2024-09-20T11:23:07.607Z    DEBUG    shim.rmcallback    
> cache/scheduler_callback.go:108    UpdateApplication callback received    
> {"UpdateApplicationResponse": 
> "updated:{applicationID:\"yunikorn-dev-03c96-autogen\" state:\"Completed\" 
> stateTransitionTimestamp:1726831387607633721 
> message:\"completeApplication\"}"}
> 2024-09-20T11:23:07.607Z    DEBUG    shim.rmcallback    
> cache/scheduler_callback.go:137    status update callback received    
> {"appId": "yunikorn-dev-03c96-autogen", "new status": "Completed"}
> 2024-09-20T11:23:07.607Z    INFO    core.scheduler.queue    
> objects/queue.go:830    Application completed and removed from queue    
> {"queueName": "root.dev-03c96", "applicationID": "yunikorn-dev-03c96-autogen"}
> 2024-09-20T11:23:07.607Z    INFO    core.scheduler.partition    
> scheduler/partition.go:1539    Removing terminated application from the 
> application list    {"appID": "yunikorn-dev-03c96-autogen", "app status": 
> "Completed"}
> 2024-09-20T11:23:07.607Z    INFO    core.scheduler.application.usage    
> objects/application_summary.go:60    YK_APP_SUMMARY: {ApplicationID: 
> yunikorn-dev-03c96-autogen, SubmissionTime: 1726831345581, StartTime: 
> 1726831347597, FinishTime: 1726831387607, User: kubernetes-admin, Queue: 
> root.dev-03c96, State: Completed, RmID: mycluster, ResourceUsage: 
> TrackedResource{UNKNOWN:pods=10}, PreemptedResource: TrackedResource{}, 
> PlaceholderResource: TrackedResource{}} {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org

Reply via email to