[
https://issues.apache.org/jira/browse/YUNIKORN-2804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17955697#comment-17955697
]
Paul Santa Clara commented on YUNIKORN-2804:
--------------------------------------------
One particularly degenerate case occurs during pod binding() when the etcd
leader's apply index is too far behind it's commit index. This will cause the
leader to reject the proposal: {color:#1d1c1d}
{color}[https://github.com/etcd-io/etcd/pull/5927/files.
|https://github.com/etcd-io/etcd/pull/5927/files]
The k8s api-server currently maps all etcd errors {{to
http.StatusInternalServerError}} instead of instructing the k8s go-client to
backoff and retry( see [https://github.com/kubernetes/kubernetes/issues/112152
|https://github.com/kubernetes/kubernetes/issues/112152]). The end result is
that Yunikorn will transition the task to 'failed'. The pod will remain in a
'Pending' state for all of eternity and binding() will never again be
attempted. If said pod happens to be a Spark driver pod, then the controller
responsible for creating it( often Kubeflow SparkOperator ) will not understand
the Spark Job has failed, simply believing that it has yet to be scheduled.
The helm project has taken to issuing explicit retries while waiting for an
upstream fix from K8s: [https://github.com/helm/helm/pull/11401/files]
> [Umbrella] Rethink general retry policy for post allocation failed task
> -----------------------------------------------------------------------
>
> Key: YUNIKORN-2804
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2804
> Project: Apache YuniKorn
> Issue Type: Improvement
> Components: core - scheduler, scheduler-interface, shim - kubernetes
> Reporter: Qi Zhu
> Assignee: Qi Zhu
> Priority: Major
>
> We are adding retry for bind volume failed in:
> [https://github.com/apache/yunikorn-k8shim/pull/890]
> *Updated, we closed the above PR after discussion.*
> We need to do the following-up instead of doing the retry above.
> As discussed here, we want to have general retry policy for future
> improvement.
> [https://github.com/apache/yunikorn-k8shim/pull/890#issuecomment-2288658926]
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]