Peter Bacsko created YUNIKORN-2520: -------------------------------------- Summary: PVC errors in AssumePod() is not handled properly Key: YUNIKORN-2520 URL: https://issues.apache.org/jira/browse/YUNIKORN-2520 Project: Apache YuniKorn Issue Type: Bug Components: shim - kubernetes Reporter: Peter Bacsko
When there is an error caused by a volume operation in {{{}AssumePod(){}}}, the allocation on core side will not be removed. Although we check the result from UpdateAllocation, the error handling is just logging: {noformat} if err := callback.UpdateAllocation(response); err != nil { rmp.handleUpdateResponseError(rmID, err) } ... func (rmp *RMProxy) handleUpdateResponseError(rmID string, err error) { log.Log(log.RMProxy).Error("failed to handle response", zap.String("rmID", rmID), zap.Error(err)) }{noformat} I suggest moving volume-related code to {{{}Task.postTaskAllocated{}}}. In this case, the task will transition to "Failed" state and we'll have allocationID available, so we can release both the ask and the allocation: {noformat} func (task *Task) releaseAllocation() { ... var releaseRequest *si.AllocationRequest s := TaskStates() switch task.GetTaskState() { case s.New, s.Pending, s.Scheduling, s.Rejected: releaseRequest = common.CreateReleaseAskRequestForTask( task.applicationID, task.taskID, task.application.partition) <-- release ask + allocation if possible default: if task.allocationID == "" { ... log error ... return } releaseRequest = common.CreateReleaseAllocationRequestForTask( task.applicationID, task.taskID, task.allocationID, task.application.partition, task.terminationType) } ...{noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org