[jira] [Created] (YUNIKORN-2582) Shim: Use atomic.Int32 for atomic operations

2024-04-23 Thread Yu-Lin Chen (Jira)
Yu-Lin Chen created YUNIKORN-2582:
-

 Summary: Shim: Use atomic.Int32 for atomic operations
 Key: YUNIKORN-2582
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2582
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: shim - kubernetes
Reporter: Yu-Lin Chen


Type atomic.Int32 was introduced in go 1.19, we should adopt it for better code 
clarity.

Below variable type should be changed:
1. asyncDispatchCount in 
[dispatcher.go|https://github.com/apache/yunikorn-k8shim/blob/299816338caa1053e76e8f0d627321335d159cd2/pkg/dispatcher/dispatcher.go#L50]
2. registerCount/UpdateAllocationCount/UpdateApplicationCount/UpdateNodeCount 
in 
[schedulerapi_mock.go|https://github.com/apache/yunikorn-k8shim/blob/299816338caa1053e76e8f0d627321335d159cd2/pkg/common/test/schedulerapi_mock.go#L30-L33]

 

PS: This Jira is based on the discussion under YUNIKORN-2576.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2581) Expose running placement rules in REST

2024-04-23 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2581:
---

 Summary: Expose running placement rules in REST
 Key: YUNIKORN-2581
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2581
 Project: Apache YuniKorn
  Issue Type: New Feature
  Components: core - common
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


Since introducing the use of placement rules always and the recovery rule the 
queue config does not correctly show the running rules.

Also if a config update has been rejected, for any reason, the rules would not 
be correct

Exposing the configured rules from the placement manager works around all these 
issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2575) Make logging for IsPodFitNode clear

2024-04-23 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YUNIKORN-2575.
-
Fix Version/s: 1.6.0
   Resolution: Fixed

unique errors are returned for all failure cases which at DEBUG level will show 
exactly why the failure occurred.

> Make logging for IsPodFitNode clear
> ---
>
> Key: YUNIKORN-2575
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2575
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: shim - kubernetes
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> The logging in {{IsPodFitNode()}} logs the same message for a missing pod and 
> node. We should log clearly which thing is missing: the node or the pod.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2580) Remove executionTimeoutMilliSeconds

2024-04-23 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YUNIKORN-2580.
-
Resolution: Won't Fix

This is used for the placeholder timeout and cannot be removed.

> Remove executionTimeoutMilliSeconds
> ---
>
> Key: YUNIKORN-2580
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2580
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: scheduler-interface
>Reporter: Chia-Ping Tsai
>Priority: Minor
>
> [https://github.com/apache/yunikorn-scheduler-interface/blob/b70081933c38018fd7f01c82635f5b186c4ef394/si.proto#L211]
> It is not used actually, and hence we should either remove it or add facility 
> for it. Personally, I'd like to remove it to simplify the interface.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2574) totalPartitionResource should not be mutated with AddTo/SubFrom

2024-04-23 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2574.

Fix Version/s: 1.6.0
   1.5.1
   Resolution: Fixed

Merged to master and branch-1.5.

> totalPartitionResource should not be mutated with AddTo/SubFrom
> ---
>
> Key: YUNIKORN-2574
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2574
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Affects Versions: 1.4.0, 1.5.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0, 1.5.1
>
>
> There is a potential data race in {{PartitionContext}}: the field 
> {{totalPartitionResource}} is mutated in place. The problem is that the 
> method {{GetTotalPartitionResource()}} does not clone it.
> {noformat}
> func (pc *PartitionContext) GetTotalPartitionResource() *resources.Resource {
>   pc.RLock()
>   defer pc.RUnlock()
>   return pc.totalPartitionResource
> }
> {noformat}
> In general, we should prefer the immutable approach for variables like this, 
> just like in {{{}objects.Queue{}}}:
> {noformat}
> func (sq *Queue) IncAllocatedResource(alloc *resources.Resource, nodeReported 
> bool) error {
>   // check this queue: failure stops checks if the allocation is not part 
> of a node addition
>   newAllocated := resources.Add(sq.allocatedResource, alloc)<  
> New object
> [ ... removed ... ]
>   sq.Lock()
>   defer sq.Unlock()
>   // all OK update this queue
>   sq.allocatedResource = newAllocated
>   sq.updateAllocatedResourceMetrics()
>   return nil
> }
> // incPendingResource increments pending resource of this queue and its 
> parents.
> func (sq *Queue) incPendingResource(delta *resources.Resource) {
>   // update the parent
>   if sq.parent != nil {
>   sq.parent.incPendingResource(delta)
>   }
>   // update this queue
>   sq.Lock()
>   defer sq.Unlock()
>   sq.pending = resources.Add(sq.pending, delta)     < New object
> sq.updatePendingResourceMetrics()
> }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2521) Scheduler deadlock

2024-04-23 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2521.

 Fix Version/s: 1.6.0
1.5.1
Target Version: 1.6.0, 1.5.1
Resolution: Fixed

This was delivered as part of YUNIKORN-2544.

> Scheduler deadlock
> --
>
> Key: YUNIKORN-2521
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2521
> Project: Apache YuniKorn
>  Issue Type: Bug
>Affects Versions: 1.5.0
> Environment: Yunikorn: 1.5
> AWS EKS: v1.28.6-eks-508b6b3
>Reporter: Noah Yoshida
>Assignee: Craig Condit
>Priority: Critical
> Fix For: 1.6.0, 1.5.1
>
> Attachments: 0001-YUNIKORN-2539-core.patch, 
> 0002-YUNIKORN-2539-k8shim.patch, 4_4_goroutine-1.txt, 4_4_goroutine-2.txt, 
> 4_4_goroutine-3.txt, 4_4_goroutine-4.txt, 4_4_goroutine-5-state-dump.txt, 
> 4_4_profile001.png, 4_4_profile002.png, 4_4_profile003.png, 
> 4_4_scheduler-logs.txt, deadlock_2024-04-18.log, goroutine-4-3-1.out, 
> goroutine-4-3-2.out, goroutine-4-3-3.out, goroutine-4-3.out, 
> goroutine-4-5.out, goroutine-dump.txt, goroutine-while-blocking-2.out, 
> goroutine-while-blocking.out, logs-potential-deadlock-2.txt, 
> logs-potential-deadlock.txt, logs-splunk-ordered.txt, logs-splunk.txt, 
> profile001-4-5.gif, profile012.gif, profile013.gif, running-logs-2.txt, 
> running-logs.txt
>
>
> Discussion on Yunikorn slack: 
> [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1711048995187179]
> Occasionally, Yunikorn will deadlock and prevent any new pods from starting. 
> All pods stay in Pending. There are no error logs inside of the Yunikorn 
> scheduler indicating any issue. 
> Additionally, the pods all have the correct annotations / labels from the 
> admission service, so they are at least getting put into k8s correctly. 
> The issue was seen intermittently on Yunikorn version 1.5 in EKS, using 
> version `v1.28.6-eks-508b6b3`. 
> At least for me, we run about 25-50 nodes and 200-400 pods. Pods and nodes 
> are added and removed pretty frequently as we do ML workloads. 
> Attached is the goroutine dump. We were not able to get a statedump as the 
> endpoint kept timing out. 
> You can fix it by restarting the Yunikorn scheduler pod. Sometimes you also 
> have to delete any "Pending" pods that got stuck while the scheduler was 
> deadlocked as well, for them to get picked up by the new scheduler pod. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2580) Remove executionTimeoutMilliSeconds or add facility for it

2024-04-23 Thread Chia-Ping Tsai (Jira)
Chia-Ping Tsai created YUNIKORN-2580:


 Summary: Remove executionTimeoutMilliSeconds or add facility for it
 Key: YUNIKORN-2580
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2580
 Project: Apache YuniKorn
  Issue Type: Improvement
Reporter: Chia-Ping Tsai


[https://github.com/apache/yunikorn-scheduler-interface/blob/b70081933c38018fd7f01c82635f5b186c4ef394/si.proto#L211]

It is not used actually, and hence we should either remove it or add facility 
for it. Personally, I'd like to remove it to simplify the interface.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2579) SI: Remove maxAllocations field from AllocationAsk

2024-04-23 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2579.

Fix Version/s: 1.6.0
   Resolution: Fixed

Merged to master.

> SI: Remove maxAllocations field from AllocationAsk
> --
>
> Key: YUNIKORN-2579
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2579
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: scheduler-interface
>Reporter: Craig Condit
>Assignee: Craig Condit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> Now that maxAllocations != 1 is no longer supported, we need to remove the 
> maxAllocationsField from the AllocationAsk in the scheduler interface.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2579) SI: Remove maxAllocations field from AllocationAsk

2024-04-23 Thread Craig Condit (Jira)
Craig Condit created YUNIKORN-2579:
--

 Summary: SI: Remove maxAllocations field from AllocationAsk
 Key: YUNIKORN-2579
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2579
 Project: Apache YuniKorn
  Issue Type: Sub-task
  Components: scheduler-interface
Reporter: Craig Condit
Assignee: Craig Condit


Now that maxAllocations != 1 is no longer supported, we need to remove the 
maxAllocationsField from the AllocationAsk in the scheduler interface.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2578) Refactor SchedulerCache.GetPod() remove bool return

2024-04-23 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2578:
---

 Summary: Refactor SchedulerCache.GetPod() remove bool return
 Key: YUNIKORN-2578
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2578
 Project: Apache YuniKorn
  Issue Type: Task
  Components: shim - kubernetes
Reporter: Wilfred Spiegelenburg


SchedulerCache {{GetPod()}} and {{GetPodNoLock()}} retrun two values:
# *v1.Pod
# bool

The boolean value is redundant as it is false if the pod is not found and a nil 
is returned for the pod. The boolean is true if the pod has a value. Testing 
for a nil pod has the same result.

We do not cache a nil pod in the cache for a pod UID



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2577) Remove named returns from IsPodFitNodeViaPreemption

2024-04-23 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2577:
---

 Summary: Remove named returns from IsPodFitNodeViaPreemption
 Key: YUNIKORN-2577
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2577
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: shim - kubernetes
Reporter: Wilfred Spiegelenburg


IsPodFitNodeViaPreemption has defined named returns but does not use them. They 
should be removed as the way they are used can cause issues that are hard to 
debug.

As part of this change we need to further cleanup:
* The variable {{ok}} also gets shadowed multiple times, not just from the 
named return declaration.
* The if construct around {{GetPodNoLock()}} is not needed as it returns a nil 
for the pod if it returns false. Just adding the result for the pod always has 
the same effect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2400) Upgrade docusaurus to 3.x

2024-04-23 Thread Yu-Lin Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu-Lin Chen resolved YUNIKORN-2400.
---
Fix Version/s: 1.6.0
   Resolution: Fixed

Merged to master. Thanks for [~ryankert]'s contribution.

> Upgrade docusaurus to 3.x
> -
>
> Key: YUNIKORN-2400
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2400
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: website
>Reporter: Wilfred Spiegelenburg
>Assignee: Hsien-Cheng(Ryan) Huang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> Docusaurus has released a major version update with updated dependencies.
> There are a lot of breaking changes in v3 and we need to follow the upgrade 
> guide as things will most likely break:
> [https://docusaurus.io/docs/migration/v3]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2576) Data Race: Flaky tests in dispatcher_test.go

2024-04-23 Thread Yu-Lin Chen (Jira)
Yu-Lin Chen created YUNIKORN-2576:
-

 Summary: Data Race: Flaky tests in dispatcher_test.go
 Key: YUNIKORN-2576
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2576
 Project: Apache YuniKorn
  Issue Type: Bug
Reporter: Yu-Lin Chen
 Attachments: shim-race.txt

How to reproduce:
 # In Shim, run 'go test ./pkg/... -race -count=10  > shim-race.txt' 

 
{code:java}
WARNING: DATA RACE
Write at 0x035315e0 by goroutine 88:
  github.com/apache/yunikorn-k8shim/pkg/dispatcher.initDispatcher()
      
/home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:73 
+0x2c4
  github.com/apache/yunikorn-k8shim/pkg/dispatcher.createDispatcher()
      
/home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatch_test.go:305
 +0x2f
  runtime.Goexit()
      /usr/local/go/src/runtime/panic.go:626 +0x5d
  testing.(*T).FailNow()
      :1 +0x31
  gotest.tools/v3/assert.Equal()
      
/home/chenyulin0719/go/pkg/mod/gotest.tools/v3@v3.5.1/assert/assert.go:205 
+0x1aa
  github.com/apache/yunikorn-k8shim/pkg/dispatcher.TestDispatchTimeout()
      
/home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatch_test.go:244
 +0x2ba
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1689 +0x21e
  testing.(*T).Run.gowrap1()
      /usr/local/go/src/testing/testing.go:1742 +0x44Previous read at 
0x035315e0 by goroutine 90:
  
github.com/apache/yunikorn-k8shim/pkg/dispatcher.(*Dispatcher).asyncDispatch.func1()
      
/home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:188 
+0x2f5
  
github.com/apache/yunikorn-k8shim/pkg/dispatcher.(*Dispatcher).asyncDispatch.gowrap1()
      
/home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:197 
+0x6eGoroutine 88 (running) created at:
  testing.(*T).Run()
      /usr/local/go/src/testing/testing.go:1742 +0x825
  testing.runTests.func1()
      /usr/local/go/src/testing/testing.go:2161 +0x85
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1689 +0x21e
  testing.runTests()
      /usr/local/go/src/testing/testing.go:2159 +0x8be
  testing.(*M).Run()
      /usr/local/go/src/testing/testing.go:2027 +0xf17
  main.main()
      _testmain.go:55 +0x2bdGoroutine 90 (running) created at:
  github.com/apache/yunikorn-k8shim/pkg/dispatcher.(*Dispatcher).asyncDispatch()
      
/home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:178 
+0x391
  github.com/apache/yunikorn-k8shim/pkg/dispatcher.(*Dispatcher).dispatch()
      
/home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:164 
+0xbb
  github.com/apache/yunikorn-k8shim/pkg/dispatcher.Dispatch()
      
/home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:142 
+0x71
  github.com/apache/yunikorn-k8shim/pkg/dispatcher.TestDispatchTimeout()
      
/home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/dispatcher/dispatch_test.go:232
 +0x244
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1689 +0x21e
  testing.(*T).Run.gowrap1()
      /usr/local/go/src/testing/testing.go:1742 +0x44
=={code}
Root Cause:
 * The [globla 
vairables|https://github.com/chenyulin0719/yunikorn-k8shim/blob/64b204a2fb3b83fde9d86ea58f5f0d1e42187472/pkg/dispatcher/dispatcher.go#L46-L51]
 in dispatcher.go is not protected when running unit tests. Each unit test will 
run initDispatcher() through 
[createDispatcher()|https://github.com/chenyulin0719/yunikorn-k8shim/blob/64b204a2fb3b83fde9d86ea58f5f0d1e42187472/pkg/dispatcher/dispatch_test.go#L305].
 * Race occurs if any other unit tests read/write the global variables before 
or after initDispatcher(). ex: TestDispatchTimeout()   
[https://github.com/chenyulin0719/yunikorn-k8shim/blob/64b204a2fb3b83fde9d86ea58f5f0d1e42187472/pkg/dispatcher/dispatcher.go#L188]

 

Solution to be discussed:
 # Refactor dispatcher.go and encapsulates global variables to Dispatcher struct
, change Dispatcher.Start(), Dispatcher.Stop() to type method
 # Implement Singleton in getDispatcher() and add a new function newDispatcher()
 # Create a new Dispatcher for each unit test

 

The race issue only happens in unit test becasue the shared vairable was 
protected by 
once.Do(initDispatcher) in dispatcher.go : 
[https://github.com/chenyulin0719/yunikorn-k8shim/blob/64b204a2fb3b83fde9d86ea58f5f0d1e42187472/pkg/dispatcher/dispatcher.go#L131]
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org