[jira] [Commented] (YUNIKORN-2784) Scheduler stuck

Wilfred Spiegelenburg (Jira) Wed, 18 Sep 2024 18:27:04 -0700


    [ 
https://issues.apache.org/jira/browse/YUNIKORN-2784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882842#comment-17882842
 ]


Wilfred Spiegelenburg commented on YUNIKORN-2784:
-------------------------------------------------

Correct there is no instant way to move. That is why we are looking at the 
change in YUNIKORN-2791. It will expose all pods even the ones not scheduled by 
YuniKorn inside YuniKorn. Instead of the pods showing up as a usage on the node 
only we see the pod and can look at possible preemption. This is the same case 
for all pod types not just daemon sets.

You have a limit range set on your cluster. The pods might be tiny when you 
create them but they are not when you schedule them. The pod asks for 3GB of 
memory as each container is given a minimum of 1GB. Check the pod for details 
it is annotated on the pod that the container resources were changed. The limit 
range will be applied to every pod in the cluster. Which means that a pod with 
3 containers each asking for 100MB of memory, 300MB total for the pod, after 
the limit range application needs 3GB when scheduling. A 10 fold increase. If 
that happens for all your pods you waste a huge amount of resources. It could 
explain also why the node is seen as "full" when you expect it to be empty.

> Scheduler stuck
> ---------------
>
>                 Key: YUNIKORN-2784
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2784
>             Project: Apache YuniKorn
>          Issue Type: Bug
>            Reporter: Dmitry
>            Priority: Major
>         Attachments: Screenshot 2024-08-02 at 1.16.30 PM.png, Screenshot 
> 2024-08-02 at 1.20.23 PM.png, dumps.tgz, logs
>
>
> Shortly after switching to yunikorn, a bunch of tiny pods get stuck pending 
> (screenshot 1). Also all other ones, but these are the most visible and 
> should be running 100%.
> After restarting the scheduler, all get scheduled immediately (screenshot 2).
> Attaching the output of `/ws/v1/stack`, `/ws/v1/fullstatedump` and 
> `/debug/pprof/goroutine?debug=2`
> Also logs from the scheduler.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org

[jira] [Commented] (YUNIKORN-2784) Scheduler stuck

Reply via email to