[ 
https://issues.apache.org/jira/browse/YUNIKORN-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805339#comment-17805339
 ] 

Peter Bacsko edited comment on YUNIKORN-2322 at 1/11/24 1:05 AM:
-----------------------------------------------------------------

Ah, I guess I know it. It might be YUNIKORN-1900. Are you seeing tons of "queue 
update failed unexpectedly" messages?

If so, it has nothing to do with reserved allocation, the internal state of the 
scheduler gets corrupted so that's why a restart is needed.

Even with 30ms latency, that means 33 pods/s rate. So it's not a performance 
issue.


was (Author: pbacsko):
Ah, I guess I know it. It might be YUNIKORN-1900. Are you seeing tons of "queue 
update failed unexpectedly" messages?

> Investigate YuniKorn stuck when scheduling latency is high
> ----------------------------------------------------------
>
>                 Key: YUNIKORN-2322
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2322
>             Project: Apache YuniKorn
>          Issue Type: Task
>          Components: core - common
>            Reporter: Rainie Li
>            Assignee: Rainie Li
>            Priority: Major
>         Attachments: Screenshot 2024-01-10 at 4.31.52 PM.png, Screenshot 
> 2024-01-10 at 4.33.40 PM.png
>
>
> We are seeing service stuck when latency increases, even cluster has 
> resource, YuniKorn will not be able to schedule apps. We have to manually 
> restart YuniKorn.
> we did profiling to find out most time are used by *tryReservedAllocate.* 
> Attached ** profiling screenshot and service latency data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org

Reply via email to