[jira] [Updated] (YUNIKORN-2520) PVC errors in AssumePod() is not handled properly

Peter Bacsko (Jira) Wed, 27 Mar 2024 10:35:05 -0700


     [ 
https://issues.apache.org/jira/browse/YUNIKORN-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Peter Bacsko updated YUNIKORN-2520:
-----------------------------------
    Description: 
When there is an error caused by a volume operation in {{Context.AssumePod()}}, 
the allocation on core side will not be removed.

Although we check the result from {{UpdateAllocation}}, the error handling is 
just logging:
{noformat}
                if err := callback.UpdateAllocation(response); err != nil {
                        rmp.handleUpdateResponseError(rmID, err)
                }
...

func (rmp *RMProxy) handleUpdateResponseError(rmID string, err error) {
    log.Log(log.RMProxy).Error("failed to handle response",
       zap.String("rmID", rmID),
       zap.Error(err))
}{noformat}
I suggest moving volume-related code to {{{}Task.postTaskAllocated()}}. In this 
case, the task will transition to "Failed" state and we'll have allocationID 
available, so we can release both the ask and the allocation:
{noformat}
func (task *Task) releaseAllocation() {
                ...
                var releaseRequest *si.AllocationRequest
                s := TaskStates()
                switch task.GetTaskState() {
                case s.New, s.Pending, s.Scheduling, s.Rejected:
                        releaseRequest = common.CreateReleaseAskRequestForTask(
                                task.applicationID, task.taskID, 
task.application.partition)  <-- release ask + allocation if possible
                default:
                        if task.allocationID == "" {
                                ... log error ...
                                return
                        }
                        releaseRequest = 
common.CreateReleaseAllocationRequestForTask(
                                task.applicationID, task.taskID, 
task.allocationID, task.application.partition, task.terminationType)
                }
...{noformat}
 

  was:
When there is an error caused by a volume operation in {{Context.AssumePod()}}, 
the allocation on core side will not be removed.

Although we check the result from {{UpdateAllocation}}, the error handling is 
just logging:
{noformat}
                if err := callback.UpdateAllocation(response); err != nil {
                        rmp.handleUpdateResponseError(rmID, err)
                }
...

func (rmp *RMProxy) handleUpdateResponseError(rmID string, err error) {
    log.Log(log.RMProxy).Error("failed to handle response",
       zap.String("rmID", rmID),
       zap.Error(err))
}{noformat}
I suggest moving volume-related code to {{{}Task.postTaskAllocated{}}}. In this 
case, the task will transition to "Failed" state and we'll have allocationID 
available, so we can release both the ask and the allocation:
{noformat}
func (task *Task) releaseAllocation() {
                ...
                var releaseRequest *si.AllocationRequest
                s := TaskStates()
                switch task.GetTaskState() {
                case s.New, s.Pending, s.Scheduling, s.Rejected:
                        releaseRequest = common.CreateReleaseAskRequestForTask(
                                task.applicationID, task.taskID, 
task.application.partition)  <-- release ask + allocation if possible
                default:
                        if task.allocationID == "" {
                                ... log error ...
                                return
                        }
                        releaseRequest = 
common.CreateReleaseAllocationRequestForTask(
                                task.applicationID, task.taskID, 
task.allocationID, task.application.partition, task.terminationType)
                }
...{noformat}
 


> PVC errors in AssumePod() is not handled properly
> -------------------------------------------------
>
>                 Key: YUNIKORN-2520
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2520
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: shim - kubernetes
>            Reporter: Peter Bacsko
>            Assignee: Peter Bacsko
>            Priority: Major
>
> When there is an error caused by a volume operation in 
> {{Context.AssumePod()}}, the allocation on core side will not be removed.
> Although we check the result from {{UpdateAllocation}}, the error handling is 
> just logging:
> {noformat}
>                 if err := callback.UpdateAllocation(response); err != nil {
>                       rmp.handleUpdateResponseError(rmID, err)
>               }
> ...
> func (rmp *RMProxy) handleUpdateResponseError(rmID string, err error) {
>     log.Log(log.RMProxy).Error("failed to handle response",
>        zap.String("rmID", rmID),
>        zap.Error(err))
> }{noformat}
> I suggest moving volume-related code to {{{}Task.postTaskAllocated()}}. In 
> this case, the task will transition to "Failed" state and we'll have 
> allocationID available, so we can release both the ask and the allocation:
> {noformat}
> func (task *Task) releaseAllocation() {
>               ...
>               var releaseRequest *si.AllocationRequest
>               s := TaskStates()
>               switch task.GetTaskState() {
>               case s.New, s.Pending, s.Scheduling, s.Rejected:
>                       releaseRequest = common.CreateReleaseAskRequestForTask(
>                               task.applicationID, task.taskID, 
> task.application.partition)  <-- release ask + allocation if possible
>               default:
>                       if task.allocationID == "" {
>                               ... log error ...
>                               return
>                       }
>                       releaseRequest = 
> common.CreateReleaseAllocationRequestForTask(
>                               task.applicationID, task.taskID, 
> task.allocationID, task.application.partition, task.terminationType)
>               }
> ...{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org

[jira] [Updated] (YUNIKORN-2520) PVC errors in AssumePod() is not handled properly

Reply via email to