[jira] [Assigned] (YUNIKORN-584) App recovery is skipped when applicationID is not set in pods' label

Kinga Marton (Jira) Mon, 22 Mar 2021 05:11:05 -0700


     [ 
https://issues.apache.org/jira/browse/YUNIKORN-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kinga Marton reassigned YUNIKORN-584:
-------------------------------------

    Assignee: Weiwei Yang

> App recovery is skipped when applicationID is not set in pods' label
> --------------------------------------------------------------------
>
>                 Key: YUNIKORN-584
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-584
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: shim - kubernetes
>            Reporter: Chaoran Yu
>            Assignee: Weiwei Yang
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 0.10
>
>
> There are cases when YK may think that the cluster doesn't have enough 
> resources even though that's not actually the case. This has happened twice 
> to me after running YK in a cluster for a few days and then one day, the 
> [nodes endpoint|https://yunikorn.apache.org/docs/next/api/scheduler#nodes] 
> shows that the cluster only has one node (i.e. the node that YK itself is 
> running on), despite that the K8s cluster has 10 nodes in total. And if I try 
> to schedule a workload that requires more resources than available on that 
> node, YK will make pods pending with an event like below:
> {quote}Normal  PodUnschedulable  41s   yunikorn  Task <namespace>/<pod> is 
> pending for the requested resources become available{quote}
> because it's not aware that other nodes in the cluster has available 
> resources.
> All of this can be fixed by just restarting YK (scaling down the replica to 0 
> and then back up to 1). So it seems that an issue with cache is causing the 
> issue, although it's not yet clear to me the exact conditions that triggered 
> this bug.
> My environment is on AWS EKS with K8s 1.17, if that matters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org

[jira] [Assigned] (YUNIKORN-584) App recovery is skipped when applicationID is not set in pods' label

Reply via email to