[jira] [Created] (YUNIKORN-2582) Shim: Use atomic.Int32 for atomic operations
Yu-Lin Chen created YUNIKORN-2582: - Summary: Shim: Use atomic.Int32 for atomic operations Key: YUNIKORN-2582 URL: https://issues.apache.org/jira/browse/YUNIKORN-2582 Project: Apache YuniKorn Issue Type: Improvement Components: shim - kubernetes Reporter: Yu-Lin Chen Type atomic.Int32 was introduced in go 1.19, we should adopt it for better code clarity. Below variable type should be changed: 1. asyncDispatchCount in [dispatcher.go|https://github.com/apache/yunikorn-k8shim/blob/299816338caa1053e76e8f0d627321335d159cd2/pkg/dispatcher/dispatcher.go#L50] 2. registerCount/UpdateAllocationCount/UpdateApplicationCount/UpdateNodeCount in [schedulerapi_mock.go|https://github.com/apache/yunikorn-k8shim/blob/299816338caa1053e76e8f0d627321335d159cd2/pkg/common/test/schedulerapi_mock.go#L30-L33] PS: This Jira is based on the discussion under YUNIKORN-2576. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2581) Expose running placement rules in REST
Wilfred Spiegelenburg created YUNIKORN-2581: --- Summary: Expose running placement rules in REST Key: YUNIKORN-2581 URL: https://issues.apache.org/jira/browse/YUNIKORN-2581 Project: Apache YuniKorn Issue Type: New Feature Components: core - common Reporter: Wilfred Spiegelenburg Assignee: Wilfred Spiegelenburg Since introducing the use of placement rules always and the recovery rule the queue config does not correctly show the running rules. Also if a config update has been rejected, for any reason, the rules would not be correct Exposing the configured rules from the placement manager works around all these issues. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2575) Make logging for IsPodFitNode clear
[ https://issues.apache.org/jira/browse/YUNIKORN-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg resolved YUNIKORN-2575. - Fix Version/s: 1.6.0 Resolution: Fixed unique errors are returned for all failure cases which at DEBUG level will show exactly why the failure occurred. > Make logging for IsPodFitNode clear > --- > > Key: YUNIKORN-2575 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2575 > Project: Apache YuniKorn > Issue Type: Improvement > Components: shim - kubernetes >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > > The logging in {{IsPodFitNode()}} logs the same message for a missing pod and > node. We should log clearly which thing is missing: the node or the pod. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2580) Remove executionTimeoutMilliSeconds
[ https://issues.apache.org/jira/browse/YUNIKORN-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg resolved YUNIKORN-2580. - Resolution: Won't Fix This is used for the placeholder timeout and cannot be removed. > Remove executionTimeoutMilliSeconds > --- > > Key: YUNIKORN-2580 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2580 > Project: Apache YuniKorn > Issue Type: Improvement > Components: scheduler-interface >Reporter: Chia-Ping Tsai >Priority: Minor > > [https://github.com/apache/yunikorn-scheduler-interface/blob/b70081933c38018fd7f01c82635f5b186c4ef394/si.proto#L211] > It is not used actually, and hence we should either remove it or add facility > for it. Personally, I'd like to remove it to simplify the interface. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2574) totalPartitionResource should not be mutated with AddTo/SubFrom
[ https://issues.apache.org/jira/browse/YUNIKORN-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2574. Fix Version/s: 1.6.0 1.5.1 Resolution: Fixed Merged to master and branch-1.5. > totalPartitionResource should not be mutated with AddTo/SubFrom > --- > > Key: YUNIKORN-2574 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2574 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Affects Versions: 1.4.0, 1.5.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0, 1.5.1 > > > There is a potential data race in {{PartitionContext}}: the field > {{totalPartitionResource}} is mutated in place. The problem is that the > method {{GetTotalPartitionResource()}} does not clone it. > {noformat} > func (pc *PartitionContext) GetTotalPartitionResource() *resources.Resource { > pc.RLock() > defer pc.RUnlock() > return pc.totalPartitionResource > } > {noformat} > In general, we should prefer the immutable approach for variables like this, > just like in {{{}objects.Queue{}}}: > {noformat} > func (sq *Queue) IncAllocatedResource(alloc *resources.Resource, nodeReported > bool) error { > // check this queue: failure stops checks if the allocation is not part > of a node addition > newAllocated := resources.Add(sq.allocatedResource, alloc)< > New object > [ ... removed ... ] > sq.Lock() > defer sq.Unlock() > // all OK update this queue > sq.allocatedResource = newAllocated > sq.updateAllocatedResourceMetrics() > return nil > } > // incPendingResource increments pending resource of this queue and its > parents. > func (sq *Queue) incPendingResource(delta *resources.Resource) { > // update the parent > if sq.parent != nil { > sq.parent.incPendingResource(delta) > } > // update this queue > sq.Lock() > defer sq.Unlock() > sq.pending = resources.Add(sq.pending, delta) < New object > sq.updatePendingResourceMetrics() > } > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2521) Scheduler deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2521. Fix Version/s: 1.6.0 1.5.1 Target Version: 1.6.0, 1.5.1 Resolution: Fixed This was delivered as part of YUNIKORN-2544. > Scheduler deadlock > -- > > Key: YUNIKORN-2521 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2521 > Project: Apache YuniKorn > Issue Type: Bug >Affects Versions: 1.5.0 > Environment: Yunikorn: 1.5 > AWS EKS: v1.28.6-eks-508b6b3 >Reporter: Noah Yoshida >Assignee: Craig Condit >Priority: Critical > Fix For: 1.6.0, 1.5.1 > > Attachments: 0001-YUNIKORN-2539-core.patch, > 0002-YUNIKORN-2539-k8shim.patch, 4_4_goroutine-1.txt, 4_4_goroutine-2.txt, > 4_4_goroutine-3.txt, 4_4_goroutine-4.txt, 4_4_goroutine-5-state-dump.txt, > 4_4_profile001.png, 4_4_profile002.png, 4_4_profile003.png, > 4_4_scheduler-logs.txt, deadlock_2024-04-18.log, goroutine-4-3-1.out, > goroutine-4-3-2.out, goroutine-4-3-3.out, goroutine-4-3.out, > goroutine-4-5.out, goroutine-dump.txt, goroutine-while-blocking-2.out, > goroutine-while-blocking.out, logs-potential-deadlock-2.txt, > logs-potential-deadlock.txt, logs-splunk-ordered.txt, logs-splunk.txt, > profile001-4-5.gif, profile012.gif, profile013.gif, running-logs-2.txt, > running-logs.txt > > > Discussion on Yunikorn slack: > [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1711048995187179] > Occasionally, Yunikorn will deadlock and prevent any new pods from starting. > All pods stay in Pending. There are no error logs inside of the Yunikorn > scheduler indicating any issue. > Additionally, the pods all have the correct annotations / labels from the > admission service, so they are at least getting put into k8s correctly. > The issue was seen intermittently on Yunikorn version 1.5 in EKS, using > version `v1.28.6-eks-508b6b3`. > At least for me, we run about 25-50 nodes and 200-400 pods. Pods and nodes > are added and removed pretty frequently as we do ML workloads. > Attached is the goroutine dump. We were not able to get a statedump as the > endpoint kept timing out. > You can fix it by restarting the Yunikorn scheduler pod. Sometimes you also > have to delete any "Pending" pods that got stuck while the scheduler was > deadlocked as well, for them to get picked up by the new scheduler pod. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2580) Remove executionTimeoutMilliSeconds or add facility for it
Chia-Ping Tsai created YUNIKORN-2580: Summary: Remove executionTimeoutMilliSeconds or add facility for it Key: YUNIKORN-2580 URL: https://issues.apache.org/jira/browse/YUNIKORN-2580 Project: Apache YuniKorn Issue Type: Improvement Reporter: Chia-Ping Tsai [https://github.com/apache/yunikorn-scheduler-interface/blob/b70081933c38018fd7f01c82635f5b186c4ef394/si.proto#L211] It is not used actually, and hence we should either remove it or add facility for it. Personally, I'd like to remove it to simplify the interface. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2579) SI: Remove maxAllocations field from AllocationAsk
[ https://issues.apache.org/jira/browse/YUNIKORN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2579. Fix Version/s: 1.6.0 Resolution: Fixed Merged to master. > SI: Remove maxAllocations field from AllocationAsk > -- > > Key: YUNIKORN-2579 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2579 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: scheduler-interface >Reporter: Craig Condit >Assignee: Craig Condit >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > Now that maxAllocations != 1 is no longer supported, we need to remove the > maxAllocationsField from the AllocationAsk in the scheduler interface. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2579) SI: Remove maxAllocations field from AllocationAsk
Craig Condit created YUNIKORN-2579: -- Summary: SI: Remove maxAllocations field from AllocationAsk Key: YUNIKORN-2579 URL: https://issues.apache.org/jira/browse/YUNIKORN-2579 Project: Apache YuniKorn Issue Type: Sub-task Components: scheduler-interface Reporter: Craig Condit Assignee: Craig Condit Now that maxAllocations != 1 is no longer supported, we need to remove the maxAllocationsField from the AllocationAsk in the scheduler interface. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2578) Refactor SchedulerCache.GetPod() remove bool return
Wilfred Spiegelenburg created YUNIKORN-2578: --- Summary: Refactor SchedulerCache.GetPod() remove bool return Key: YUNIKORN-2578 URL: https://issues.apache.org/jira/browse/YUNIKORN-2578 Project: Apache YuniKorn Issue Type: Task Components: shim - kubernetes Reporter: Wilfred Spiegelenburg SchedulerCache {{GetPod()}} and {{GetPodNoLock()}} retrun two values: # *v1.Pod # bool The boolean value is redundant as it is false if the pod is not found and a nil is returned for the pod. The boolean is true if the pod has a value. Testing for a nil pod has the same result. We do not cache a nil pod in the cache for a pod UID -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2577) Remove named returns from IsPodFitNodeViaPreemption
Wilfred Spiegelenburg created YUNIKORN-2577: --- Summary: Remove named returns from IsPodFitNodeViaPreemption Key: YUNIKORN-2577 URL: https://issues.apache.org/jira/browse/YUNIKORN-2577 Project: Apache YuniKorn Issue Type: Improvement Components: shim - kubernetes Reporter: Wilfred Spiegelenburg IsPodFitNodeViaPreemption has defined named returns but does not use them. They should be removed as the way they are used can cause issues that are hard to debug. As part of this change we need to further cleanup: * The variable {{ok}} also gets shadowed multiple times, not just from the named return declaration. * The if construct around {{GetPodNoLock()}} is not needed as it returns a nil for the pod if it returns false. Just adding the result for the pod always has the same effect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2400) Upgrade docusaurus to 3.x
[ https://issues.apache.org/jira/browse/YUNIKORN-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu-Lin Chen resolved YUNIKORN-2400. --- Fix Version/s: 1.6.0 Resolution: Fixed Merged to master. Thanks for [~ryankert]'s contribution. > Upgrade docusaurus to 3.x > - > > Key: YUNIKORN-2400 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2400 > Project: Apache YuniKorn > Issue Type: Improvement > Components: website >Reporter: Wilfred Spiegelenburg >Assignee: Hsien-Cheng(Ryan) Huang >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > Docusaurus has released a major version update with updated dependencies. > There are a lot of breaking changes in v3 and we need to follow the upgrade > guide as things will most likely break: > [https://docusaurus.io/docs/migration/v3] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2576) Data Race: Flaky tests in dispatcher_test.go
Yu-Lin Chen created YUNIKORN-2576: - Summary: Data Race: Flaky tests in dispatcher_test.go Key: YUNIKORN-2576 URL: https://issues.apache.org/jira/browse/YUNIKORN-2576 Project: Apache YuniKorn Issue Type: Bug Reporter: Yu-Lin Chen Attachments: shim-race.txt How to reproduce: # In Shim, run 'go test ./pkg/... -race -count=10 > shim-race.txt' {code:java} WARNING: DATA RACE Write at 0x035315e0 by goroutine 88: github.com/apache/yunikorn-k8shim/pkg/dispatcher.initDispatcher() /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:73 +0x2c4 github.com/apache/yunikorn-k8shim/pkg/dispatcher.createDispatcher() /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatch_test.go:305 +0x2f runtime.Goexit() /usr/local/go/src/runtime/panic.go:626 +0x5d testing.(*T).FailNow() :1 +0x31 gotest.tools/v3/assert.Equal() /home/chenyulin0719/go/pkg/mod/gotest.tools/v3@v3.5.1/assert/assert.go:205 +0x1aa github.com/apache/yunikorn-k8shim/pkg/dispatcher.TestDispatchTimeout() /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatch_test.go:244 +0x2ba testing.tRunner() /usr/local/go/src/testing/testing.go:1689 +0x21e testing.(*T).Run.gowrap1() /usr/local/go/src/testing/testing.go:1742 +0x44Previous read at 0x035315e0 by goroutine 90: github.com/apache/yunikorn-k8shim/pkg/dispatcher.(*Dispatcher).asyncDispatch.func1() /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:188 +0x2f5 github.com/apache/yunikorn-k8shim/pkg/dispatcher.(*Dispatcher).asyncDispatch.gowrap1() /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:197 +0x6eGoroutine 88 (running) created at: testing.(*T).Run() /usr/local/go/src/testing/testing.go:1742 +0x825 testing.runTests.func1() /usr/local/go/src/testing/testing.go:2161 +0x85 testing.tRunner() /usr/local/go/src/testing/testing.go:1689 +0x21e testing.runTests() /usr/local/go/src/testing/testing.go:2159 +0x8be testing.(*M).Run() /usr/local/go/src/testing/testing.go:2027 +0xf17 main.main() _testmain.go:55 +0x2bdGoroutine 90 (running) created at: github.com/apache/yunikorn-k8shim/pkg/dispatcher.(*Dispatcher).asyncDispatch() /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:178 +0x391 github.com/apache/yunikorn-k8shim/pkg/dispatcher.(*Dispatcher).dispatch() /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:164 +0xbb github.com/apache/yunikorn-k8shim/pkg/dispatcher.Dispatch() /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:142 +0x71 github.com/apache/yunikorn-k8shim/pkg/dispatcher.TestDispatchTimeout() /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatch_test.go:232 +0x244 testing.tRunner() /usr/local/go/src/testing/testing.go:1689 +0x21e testing.(*T).Run.gowrap1() /usr/local/go/src/testing/testing.go:1742 +0x44 =={code} Root Cause: * The [globla vairables|https://github.com/chenyulin0719/yunikorn-k8shim/blob/64b204a2fb3b83fde9d86ea58f5f0d1e42187472/pkg/dispatcher/dispatcher.go#L46-L51] in dispatcher.go is not protected when running unit tests. Each unit test will run initDispatcher() through [createDispatcher()|https://github.com/chenyulin0719/yunikorn-k8shim/blob/64b204a2fb3b83fde9d86ea58f5f0d1e42187472/pkg/dispatcher/dispatch_test.go#L305]. * Race occurs if any other unit tests read/write the global variables before or after initDispatcher(). ex: TestDispatchTimeout() [https://github.com/chenyulin0719/yunikorn-k8shim/blob/64b204a2fb3b83fde9d86ea58f5f0d1e42187472/pkg/dispatcher/dispatcher.go#L188] Solution to be discussed: # Refactor dispatcher.go and encapsulates global variables to Dispatcher struct , change Dispatcher.Start(), Dispatcher.Stop() to type method # Implement Singleton in getDispatcher() and add a new function newDispatcher() # Create a new Dispatcher for each unit test The race issue only happens in unit test becasue the shared vairable was protected by once.Do(initDispatcher) in dispatcher.go : [https://github.com/chenyulin0719/yunikorn-k8shim/blob/64b204a2fb3b83fde9d86ea58f5f0d1e42187472/pkg/dispatcher/dispatcher.go#L131] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org