[ https://issues.apache.org/jira/browse/YUNIKORN-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yongjun Zhang closed YUNIKORN-1996. ----------------------------------- Resolution: Invalid > Change a log about queue update failure due to max capacity reached from Warn > to Debug > -------------------------------------------------------------------------------------- > > Key: YUNIKORN-1996 > URL: https://issues.apache.org/jira/browse/YUNIKORN-1996 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - scheduler > Reporter: Yongjun Zhang > Assignee: Yongjun Zhang > Priority: Major > Labels: pull-request-available > > We are seeing similar issue as in YUNIKORN-1985: > Tons of logs (62k of them in 3 seconds for the same request) because the max > capacity of a queue has reached, > {code:java} > log.Log(log.SchedApplication).Warn("queue update failed unexpectedly", > zap.Error(err)) > {code} > in > {code:java} > func (sa *Application) tryNode(node *Node, ask *AllocationAsk) *Allocation { > ... > // everything OK really allocate > alloc := NewAllocation(common.GetNewUUID(), node.NodeID, ask) > if node.AddAllocation(alloc) { > if err := sa.queue.IncAllocatedResource(alloc.GetAllocatedResource(), > false); err != nil { > log.Log(log.SchedApplication).Warn("queue update failed unexpectedly", > zap.Error(err)) > // revert the node update > node.RemoveAllocation(alloc.GetUUID()) > return nil > }{code} > I strongly suspect it’s simply because Yunikorn is trying a lot of nodes > again and again, without being aware that the queue capacity exceeded, thus > doing unnecessary work (because each try at that time is going to fail due to > max capacity reached) > This certainly would impact Yunikorn’s performance. > I guess we need to introduce a categories of exceptions (MaxQueueCapReached, > RequiredNodeUnavailable etc) that require delay before retry, and let the > upper stack to catch the exception, put the allocation into a queue or > something similar, and wait for certain period of time before retrying. > But as a first step, we can just change the log to Debug level. Since the UI > provide a way to check how much resource a given queue is used, and whether > it's at its max capacity reached, we don't lose too much diagnosis capability > after changing the log to Debug. > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org