[jira] [Closed] (YUNIKORN-2731) YuniKorn stopped scheduling new containers with negative vcore in queue

2024-08-29 Thread Xi Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xi Chen closed YUNIKORN-2731.
-
Fix Version/s: 1.5.2
   Resolution: Fixed

> YuniKorn stopped scheduling new containers with negative vcore in queue
> ---
>
> Key: YUNIKORN-2731
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2731
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Affects Versions: 1.5.1
>Reporter: Xi Chen
>Priority: Major
> Fix For: 1.5.2
>
> Attachments: Applications stuck in Accepted status.png
>
>
> We have encountered this issue in one of our clusters every a few days. We 
> are running a version that is built from branch 
> [https://github.com/apache/yunikorn-k8shim/commits/branch-1.5/] commit 
> fb4e3f11345e6a9866dfaea97770c94b9421807b.
> Here is our configuration of queues.yaml.
> {code:java}
> partitions:
>   - name: default
> nodesortpolicy:
>   type: binpacking
> preemption:
>   enabled: false
> placementrules:
>   - name: tag
> value: namespace
> create: false
> queues:
>   - name: root
> submitacl: '*'
> queues:
>   - name: c
> resources:
>   guaranteed:
> memory: 13000Gi
> vcore: 3250
>   max:
> memory: 13000Gi
> vcore: 3250
> properties:
>   application.sort.policy: fair
>   - name: e
> resources:
>   guaranteed:
> memory: 2600Gi
> vcore: 650
>   max:
> memory: 2600Gi
> vcore: 650
> properties:
>   application.sort.policy: fair
>   - name: m1
> resources:
>   guaranteed:
> memory: 1000Gi
> vcore: 250
>   max:
> memory: 1000Gi
> vcore: 250
> properties:
>   application.sort.policy: fair
>   - name: m2
> resources:
>   guaranteed:
> memory: 62000Gi
> vcore: 15500
>   max:
> memory: 62000Gi
> vcore: 15500
> properties:
>   application.sort.policy: fair {code}
> The issue is that at some point the scheduler would stop starting new 
> containers, and there would be 0 containers running finally and lots of 
> applications in Accepted status.
> !Applications stuck in Accepted status.png|width=1211,height=407!
> There are some logs that contains negative vcore resource, and these logs are 
> highly corralated with this issue in timeline.
> {code:java}
> 2024-07-08T10:19:13.436Z    INFO    core.scheduler    
> scheduler/scheduler.go:101    Found outstanding requests that will trigger 
> autoscaling    {"number of requests": 1, "total resources": 
> "map[memory:2147483648 pods:1 vcore:500]"}
> E0708 10:19:13.604563       1 event_broadcaster.go:270] "Server rejected 
> event (will not retry!)" err="Event 
> \"example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76\" is 
> invalid: [action: Required value, reason: Required value]" 
> event="&Event{ObjectMeta:{example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76
>   c    0 0001-01-01 00:00:00 + UTC   map[] map[] [] [] 
> []},EventTime:2024-07-08 10:19:13.60205325 + UTC 
> m=+524778.657618517,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod
>  c example-job-1720433945-574-aa32179091daba13-driver 
> a1901455-3a23-4dbc-bbb2-9dd2dfc775ea v1 201821875 },Related:nil,Note:Request 
> 'a1901455-3a23-4dbc-bbb2-9dd2dfc775ea' does not fit in queue 'root.c' 
> (requested map[memory:2147483648 pods:1 vcore:500], available 
> map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 
> hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 
> vcore:-56150]),Type:Normal,DeprecatedSource:{ 
> },DeprecatedFirstTimestamp:0001-01-01 00:00:00 + 
> UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 + UTC,DeprecatedCount:0,}"
> 2024-07-08T10:19:05.391Z    INFO    core.scheduler    
> scheduler/scheduler.go:101    Found outstanding requests that will trigger 
> autoscaling    {"number of requests": 1, "total resources": 
> "map[memory:2147483648 pods:1 vcore:500]"}
> E0708 10:19:05.601679       1 event_broadcaster.go:270] "Server rejected 
> event (will not retry!)" err="Event 
> \"example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4\" is 
> invalid: [action: Required value, reason: Required value]" 
> event="&Event{ObjectMeta:{example-job-1720433937-295

[jira] [Commented] (YUNIKORN-2731) YuniKorn stopped scheduling new containers with negative vcore in queue

2024-08-29 Thread Xi Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877986#comment-17877986
 ] 

Xi Chen commented on YUNIKORN-2731:
---

This issue is gone after upgrading to v1.5.2

> YuniKorn stopped scheduling new containers with negative vcore in queue
> ---
>
> Key: YUNIKORN-2731
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2731
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Affects Versions: 1.5.1
>Reporter: Xi Chen
>Priority: Major
> Attachments: Applications stuck in Accepted status.png
>
>
> We have encountered this issue in one of our clusters every a few days. We 
> are running a version that is built from branch 
> [https://github.com/apache/yunikorn-k8shim/commits/branch-1.5/] commit 
> fb4e3f11345e6a9866dfaea97770c94b9421807b.
> Here is our configuration of queues.yaml.
> {code:java}
> partitions:
>   - name: default
> nodesortpolicy:
>   type: binpacking
> preemption:
>   enabled: false
> placementrules:
>   - name: tag
> value: namespace
> create: false
> queues:
>   - name: root
> submitacl: '*'
> queues:
>   - name: c
> resources:
>   guaranteed:
> memory: 13000Gi
> vcore: 3250
>   max:
> memory: 13000Gi
> vcore: 3250
> properties:
>   application.sort.policy: fair
>   - name: e
> resources:
>   guaranteed:
> memory: 2600Gi
> vcore: 650
>   max:
> memory: 2600Gi
> vcore: 650
> properties:
>   application.sort.policy: fair
>   - name: m1
> resources:
>   guaranteed:
> memory: 1000Gi
> vcore: 250
>   max:
> memory: 1000Gi
> vcore: 250
> properties:
>   application.sort.policy: fair
>   - name: m2
> resources:
>   guaranteed:
> memory: 62000Gi
> vcore: 15500
>   max:
> memory: 62000Gi
> vcore: 15500
> properties:
>   application.sort.policy: fair {code}
> The issue is that at some point the scheduler would stop starting new 
> containers, and there would be 0 containers running finally and lots of 
> applications in Accepted status.
> !Applications stuck in Accepted status.png|width=1211,height=407!
> There are some logs that contains negative vcore resource, and these logs are 
> highly corralated with this issue in timeline.
> {code:java}
> 2024-07-08T10:19:13.436Z    INFO    core.scheduler    
> scheduler/scheduler.go:101    Found outstanding requests that will trigger 
> autoscaling    {"number of requests": 1, "total resources": 
> "map[memory:2147483648 pods:1 vcore:500]"}
> E0708 10:19:13.604563       1 event_broadcaster.go:270] "Server rejected 
> event (will not retry!)" err="Event 
> \"example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76\" is 
> invalid: [action: Required value, reason: Required value]" 
> event="&Event{ObjectMeta:{example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76
>   c    0 0001-01-01 00:00:00 + UTC   map[] map[] [] [] 
> []},EventTime:2024-07-08 10:19:13.60205325 + UTC 
> m=+524778.657618517,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod
>  c example-job-1720433945-574-aa32179091daba13-driver 
> a1901455-3a23-4dbc-bbb2-9dd2dfc775ea v1 201821875 },Related:nil,Note:Request 
> 'a1901455-3a23-4dbc-bbb2-9dd2dfc775ea' does not fit in queue 'root.c' 
> (requested map[memory:2147483648 pods:1 vcore:500], available 
> map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 
> hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 
> vcore:-56150]),Type:Normal,DeprecatedSource:{ 
> },DeprecatedFirstTimestamp:0001-01-01 00:00:00 + 
> UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 + UTC,DeprecatedCount:0,}"
> 2024-07-08T10:19:05.391Z    INFO    core.scheduler    
> scheduler/scheduler.go:101    Found outstanding requests that will trigger 
> autoscaling    {"number of requests": 1, "total resources": 
> "map[memory:2147483648 pods:1 vcore:500]"}
> E0708 10:19:05.601679       1 event_broadcaster.go:270] "Server rejected 
> event (will not retry!)" err="Event 
> \"example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4\" is 
> invalid: [action: Required value, reason: Required value]" 
> event="&Event{ObjectMeta:{ex

[jira] [Commented] (YUNIKORN-2731) YuniKorn stopped scheduling new containers with negative vcore in queue

2024-07-17 Thread Xi Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1784#comment-1784
 ] 

Xi Chen commented on YUNIKORN-2731:
---

This is the root queue status in another incident, in which the allocated CPU 
is greater than max CPU:
{code:java}
Queue Info
Name: root
Status: Active
Allocated:
- Memory: 575.06 GiB
- CPU: 261
- pods: 69
Pending:
- Memory: 332.19 GiB
- CPU: 38
- pods: 76
Max:
- Memory: 1.19 TiB
- CPU: 205.55
- pods: 1.79k
- ephemeral-storage: 3.07 TB
- hugepages-1Gi: 0 B
- hugepages-2Mi: 0 B
- hugepages-32Mi: 0 B
- hugepages-64Ki: 0 B
Guaranteed:
- n/a
Absolute Used Capacity:
- Memory: 47%
- CPU: 126% {code}
 
 

> YuniKorn stopped scheduling new containers with negative vcore in queue
> ---
>
> Key: YUNIKORN-2731
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2731
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Affects Versions: 1.5.1
>Reporter: Xi Chen
>Priority: Major
> Attachments: Applications stuck in Accepted status.png
>
>
> We have encountered this issue in one of our clusters every a few days. We 
> are running a version that is built from branch 
> [https://github.com/apache/yunikorn-k8shim/commits/branch-1.5/] commit 
> fb4e3f11345e6a9866dfaea97770c94b9421807b.
> Here is our configuration of queues.yaml.
> {code:java}
> partitions:
>   - name: default
> nodesortpolicy:
>   type: binpacking
> preemption:
>   enabled: false
> placementrules:
>   - name: tag
> value: namespace
> create: false
> queues:
>   - name: root
> submitacl: '*'
> queues:
>   - name: c
> resources:
>   guaranteed:
> memory: 13000Gi
> vcore: 3250
>   max:
> memory: 13000Gi
> vcore: 3250
> properties:
>   application.sort.policy: fair
>   - name: e
> resources:
>   guaranteed:
> memory: 2600Gi
> vcore: 650
>   max:
> memory: 2600Gi
> vcore: 650
> properties:
>   application.sort.policy: fair
>   - name: m1
> resources:
>   guaranteed:
> memory: 1000Gi
> vcore: 250
>   max:
> memory: 1000Gi
> vcore: 250
> properties:
>   application.sort.policy: fair
>   - name: m2
> resources:
>   guaranteed:
> memory: 62000Gi
> vcore: 15500
>   max:
> memory: 62000Gi
> vcore: 15500
> properties:
>   application.sort.policy: fair {code}
> The issue is that at some point the scheduler would stop starting new 
> containers, and there would be 0 containers running finally and lots of 
> applications in Accepted status.
> !Applications stuck in Accepted status.png|width=1211,height=407!
> There are some logs that contains negative vcore resource, and these logs are 
> highly corralated with this issue in timeline.
> {code:java}
> 2024-07-08T10:19:13.436Z    INFO    core.scheduler    
> scheduler/scheduler.go:101    Found outstanding requests that will trigger 
> autoscaling    {"number of requests": 1, "total resources": 
> "map[memory:2147483648 pods:1 vcore:500]"}
> E0708 10:19:13.604563       1 event_broadcaster.go:270] "Server rejected 
> event (will not retry!)" err="Event 
> \"example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76\" is 
> invalid: [action: Required value, reason: Required value]" 
> event="&Event{ObjectMeta:{example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76
>   c    0 0001-01-01 00:00:00 + UTC   map[] map[] [] [] 
> []},EventTime:2024-07-08 10:19:13.60205325 + UTC 
> m=+524778.657618517,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod
>  c example-job-1720433945-574-aa32179091daba13-driver 
> a1901455-3a23-4dbc-bbb2-9dd2dfc775ea v1 201821875 },Related:nil,Note:Request 
> 'a1901455-3a23-4dbc-bbb2-9dd2dfc775ea' does not fit in queue 'root.c' 
> (requested map[memory:2147483648 pods:1 vcore:500], available 
> map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 
> hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 
> vcore:-56150]),Type:Normal,DeprecatedSource:{ 
> },DeprecatedFirstTimestamp:0001-01-01 00:00:00 + 
> UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 + UTC,DeprecatedCount:0,}"
> 2024-07-08T10:19:05.391Z    INFO    core.scheduler    
> scheduler/

[jira] [Commented] (YUNIKORN-2629) Adding a node can result in a deadlock

2024-07-08 Thread Xi Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17864029#comment-17864029
 ] 

Xi Chen commented on YUNIKORN-2629:
---

[~wilfreds] Thanks for looking into this! I opened a new ticket: YUNIKORN-2731

> Adding a node can result in a deadlock
> --
>
> Key: YUNIKORN-2629
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Affects Versions: 1.5.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.5.2
>
> Attachments: updateNode_deadlock_trace.txt, 
> yunikorn-scheduler-20240627.log, yunikorn_stuck_stack_20240708.txt
>
>
> Adding a new node after Yunikorn state initialization can result in a 
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting 
> for the {{NodeAccepted}} event:
> {noformat}
>dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
> func(event interface{}) {
>   nodeEvent, ok := event.(CachedSchedulerNodeEvent)
>   if !ok {
>   return
>   }
>   [...] removed for clarity
>   wg.Done()
>   })
>   defer dispatcher.UnregisterEventHandler(handlerID, 
> dispatcher.EventTypeNode)
>   if err := 
> ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode(&si.NodeRequest{
>   Nodes: nodesToRegister,
>   RmID:  schedulerconf.GetSchedulerConf().ClusterID,
>   }); err != nil {
>   log.Log(log.ShimContext).Error("Failed to register nodes", 
> zap.Error(err))
>   return nil, err
>   }
>   // wait for all responses to accumulate
>   wg.Wait()  <--- shim gets stuck here
>  {noformat}
> If tasks are being processed, then the dispatcher will try to retrieve the 
> evend handler, which is returned from Context:
> {noformat}
> go func() {
>   for {
>   select {
>   case event := <-getDispatcher().eventChan:
>   switch v := event.(type) {
>   case events.TaskEvent:
>   getEventHandler(EventTypeTask)(v)  <--- 
> eventually calls Context.getTask()
>   case events.ApplicationEvent:
>   getEventHandler(EventTypeApp)(v)
>   case events.SchedulerNodeEvent:
>   getEventHandler(EventTypeNode)(v)  
> {noformat}
> Since {{addNode()}} is holding a write lock, the event processing loop gets 
> stuck, so {{registerNodes()}} will never progress.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2731) YuniKorn stopped scheduling new containers with negative vcore in queue

2024-07-08 Thread Xi Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xi Chen updated YUNIKORN-2731:
--
Description: 
We have encountered this issue in one of our clusters every a few days. We are 
running a version that is built from branch 
[https://github.com/apache/yunikorn-k8shim/commits/branch-1.5/] commit 
fb4e3f11345e6a9866dfaea97770c94b9421807b.

Here is our configuration of queues.yaml.
{code:java}
partitions:
  - name: default
nodesortpolicy:
  type: binpacking
preemption:
  enabled: false
placementrules:
  - name: tag
value: namespace
create: false
queues:
  - name: root
submitacl: '*'
queues:
  - name: c
resources:
  guaranteed:
memory: 13000Gi
vcore: 3250
  max:
memory: 13000Gi
vcore: 3250
properties:
  application.sort.policy: fair
  - name: e
resources:
  guaranteed:
memory: 2600Gi
vcore: 650
  max:
memory: 2600Gi
vcore: 650
properties:
  application.sort.policy: fair
  - name: m1
resources:
  guaranteed:
memory: 1000Gi
vcore: 250
  max:
memory: 1000Gi
vcore: 250
properties:
  application.sort.policy: fair
  - name: m2
resources:
  guaranteed:
memory: 62000Gi
vcore: 15500
  max:
memory: 62000Gi
vcore: 15500
properties:
  application.sort.policy: fair {code}
The issue is that at some point the scheduler would stop starting new 
containers, and there would be 0 containers running finally and lots of 
applications in Accepted status.

!Applications stuck in Accepted status.png|width=1211,height=407!

There are some logs that contains negative vcore resource, and these logs are 
highly corralated with this issue in timeline.
{code:java}
2024-07-08T10:19:13.436Z    INFO    core.scheduler    
scheduler/scheduler.go:101    Found outstanding requests that will trigger 
autoscaling    {"number of requests": 1, "total resources": 
"map[memory:2147483648 pods:1 vcore:500]"}
E0708 10:19:13.604563       1 event_broadcaster.go:270] "Server rejected event 
(will not retry!)" err="Event 
\"example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76\" is 
invalid: [action: Required value, reason: Required value]" 
event="&Event{ObjectMeta:{example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76
  c    0 0001-01-01 00:00:00 + UTC   map[] map[] [] [] 
[]},EventTime:2024-07-08 10:19:13.60205325 + UTC 
m=+524778.657618517,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod
 c example-job-1720433945-574-aa32179091daba13-driver 
a1901455-3a23-4dbc-bbb2-9dd2dfc775ea v1 201821875 },Related:nil,Note:Request 
'a1901455-3a23-4dbc-bbb2-9dd2dfc775ea' does not fit in queue 'root.c' 
(requested map[memory:2147483648 pods:1 vcore:500], available 
map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 
hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 
vcore:-56150]),Type:Normal,DeprecatedSource:{ 
},DeprecatedFirstTimestamp:0001-01-01 00:00:00 + 
UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 + UTC,DeprecatedCount:0,}"

2024-07-08T10:19:05.391Z    INFO    core.scheduler    
scheduler/scheduler.go:101    Found outstanding requests that will trigger 
autoscaling    {"number of requests": 1, "total resources": 
"map[memory:2147483648 pods:1 vcore:500]"}
E0708 10:19:05.601679       1 event_broadcaster.go:270] "Server rejected event 
(will not retry!)" err="Event 
\"example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4\" is 
invalid: [action: Required value, reason: Required value]" 
event="&Event{ObjectMeta:{example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4
  e    0 0001-01-01 00:00:00 + UTC   map[] map[] [] [] 
[]},EventTime:2024-07-08 10:19:05.599216316 + UTC 
m=+524770.654781585,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod
 e example-job-1720433937-295-e7b2229091da99a7-driver 
14a40bac-5e89-4293-bcb7-936c544694a2 v1 201821666 },Related:nil,Note:Request 
'14a40bac-5e89-4293-bcb7-936c544694a2' does not fit in queue 'root.e' 
(requested map[memory:2147483648 pods:1 vcore:500], available 
map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 
hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 
vcore:-56150]),Type:Normal,DeprecatedSource:{ 
},DeprecatedFirstTi

[jira] [Updated] (YUNIKORN-2731) YuniKorn stopped scheduling new containers with negative vcore in queue

2024-07-08 Thread Xi Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xi Chen updated YUNIKORN-2731:
--
Attachment: (was: Screenshot 2024-07-09 at 2.40.51 PM.png)

> YuniKorn stopped scheduling new containers with negative vcore in queue
> ---
>
> Key: YUNIKORN-2731
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2731
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Affects Versions: 1.5.1
>Reporter: Xi Chen
>Priority: Major
> Attachments: Applications stuck in Accepted status.png
>
>
> We have encountered this issue in one of our clusters every a few days. We 
> are running a version that is built from branch 
> [https://github.com/apache/yunikorn-k8shim/commits/branch-1.5/] commit 
> fb4e3f11345e6a9866dfaea97770c94b9421807b.
> Here is our configuration of queues.yaml.
>  
> {code:java}
> partitions:
>   - name: default
> nodesortpolicy:
>   type: binpacking
> preemption:
>   enabled: false
> placementrules:
>   - name: tag
> value: namespace
> create: false
> queues:
>   - name: root
> submitacl: '*'
> queues:
>   - name: c
> resources:
>   guaranteed:
> memory: 13000Gi
> vcore: 3250
>   max:
> memory: 13000Gi
> vcore: 3250
> properties:
>   application.sort.policy: fair
>   - name: e
> resources:
>   guaranteed:
> memory: 2600Gi
> vcore: 650
>   max:
> memory: 2600Gi
> vcore: 650
> properties:
>   application.sort.policy: fair
>   - name: m1
> resources:
>   guaranteed:
> memory: 1000Gi
> vcore: 250
>   max:
> memory: 1000Gi
> vcore: 250
> properties:
>   application.sort.policy: fair
>   - name: m2
> resources:
>   guaranteed:
> memory: 62000Gi
> vcore: 15500
>   max:
> memory: 62000Gi
> vcore: 15500
> properties:
>   application.sort.policy: fair {code}
> The issue is that at some point the scheduler would stop starting new 
> containers, and there would be 0 containers running finally and lots of 
> applications in Accepted status.
>  
> There are some logs that contains negative vcore resource, and these logs are 
> highly corralated with this issue in timeline.
>  
> {code:java}
> 2024-07-08T10:19:13.436Z    INFO    core.scheduler    
> scheduler/scheduler.go:101    Found outstanding requests that will trigger 
> autoscaling    {"number of requests": 1, "total resources": 
> "map[memory:2147483648 pods:1 vcore:500]"}
> E0708 10:19:13.604563       1 event_broadcaster.go:270] "Server rejected 
> event (will not retry!)" err="Event 
> \"example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76\" is 
> invalid: [action: Required value, reason: Required value]" 
> event="&Event{ObjectMeta:{example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76
>   c    0 0001-01-01 00:00:00 + UTC   map[] map[] [] [] 
> []},EventTime:2024-07-08 10:19:13.60205325 + UTC 
> m=+524778.657618517,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod
>  c example-job-1720433945-574-aa32179091daba13-driver 
> a1901455-3a23-4dbc-bbb2-9dd2dfc775ea v1 201821875 },Related:nil,Note:Request 
> 'a1901455-3a23-4dbc-bbb2-9dd2dfc775ea' does not fit in queue 'root.c' 
> (requested map[memory:2147483648 pods:1 vcore:500], available 
> map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 
> hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 
> vcore:-56150]),Type:Normal,DeprecatedSource:{ 
> },DeprecatedFirstTimestamp:0001-01-01 00:00:00 + 
> UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 + UTC,DeprecatedCount:0,}"
> 2024-07-08T10:19:05.391Z    INFO    core.scheduler    
> scheduler/scheduler.go:101    Found outstanding requests that will trigger 
> autoscaling    {"number of requests": 1, "total resources": 
> "map[memory:2147483648 pods:1 vcore:500]"}
> E0708 10:19:05.601679       1 event_broadcaster.go:270] "Server rejected 
> event (will not retry!)" err="Event 
> \"example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4\" is 
> invalid: [action: Required value, reason: Required value]" 
> event="&Event{ObjectMeta:{example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4
>   e    0 0001-01-01 00:0

[jira] [Updated] (YUNIKORN-2731) YuniKorn stopped scheduling new containers with negative vcore in queue

2024-07-08 Thread Xi Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xi Chen updated YUNIKORN-2731:
--
Attachment: Screenshot 2024-07-09 at 2.40.51 PM.png

> YuniKorn stopped scheduling new containers with negative vcore in queue
> ---
>
> Key: YUNIKORN-2731
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2731
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Affects Versions: 1.5.1
>Reporter: Xi Chen
>Priority: Major
> Attachments: Applications stuck in Accepted status.png
>
>
> We have encountered this issue in one of our clusters every a few days. We 
> are running a version that is built from branch 
> [https://github.com/apache/yunikorn-k8shim/commits/branch-1.5/] commit 
> fb4e3f11345e6a9866dfaea97770c94b9421807b.
> Here is our configuration of queues.yaml.
>  
> {code:java}
> partitions:
>   - name: default
> nodesortpolicy:
>   type: binpacking
> preemption:
>   enabled: false
> placementrules:
>   - name: tag
> value: namespace
> create: false
> queues:
>   - name: root
> submitacl: '*'
> queues:
>   - name: c
> resources:
>   guaranteed:
> memory: 13000Gi
> vcore: 3250
>   max:
> memory: 13000Gi
> vcore: 3250
> properties:
>   application.sort.policy: fair
>   - name: e
> resources:
>   guaranteed:
> memory: 2600Gi
> vcore: 650
>   max:
> memory: 2600Gi
> vcore: 650
> properties:
>   application.sort.policy: fair
>   - name: m1
> resources:
>   guaranteed:
> memory: 1000Gi
> vcore: 250
>   max:
> memory: 1000Gi
> vcore: 250
> properties:
>   application.sort.policy: fair
>   - name: m2
> resources:
>   guaranteed:
> memory: 62000Gi
> vcore: 15500
>   max:
> memory: 62000Gi
> vcore: 15500
> properties:
>   application.sort.policy: fair {code}
> The issue is that at some point the scheduler would stop starting new 
> containers, and there would be 0 containers running finally and lots of 
> applications in Accepted status.
>  
> There are some logs that contains negative vcore resource, and these logs are 
> highly corralated with this issue in timeline.
>  
> {code:java}
> 2024-07-08T10:19:13.436Z    INFO    core.scheduler    
> scheduler/scheduler.go:101    Found outstanding requests that will trigger 
> autoscaling    {"number of requests": 1, "total resources": 
> "map[memory:2147483648 pods:1 vcore:500]"}
> E0708 10:19:13.604563       1 event_broadcaster.go:270] "Server rejected 
> event (will not retry!)" err="Event 
> \"example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76\" is 
> invalid: [action: Required value, reason: Required value]" 
> event="&Event{ObjectMeta:{example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76
>   c    0 0001-01-01 00:00:00 + UTC   map[] map[] [] [] 
> []},EventTime:2024-07-08 10:19:13.60205325 + UTC 
> m=+524778.657618517,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod
>  c example-job-1720433945-574-aa32179091daba13-driver 
> a1901455-3a23-4dbc-bbb2-9dd2dfc775ea v1 201821875 },Related:nil,Note:Request 
> 'a1901455-3a23-4dbc-bbb2-9dd2dfc775ea' does not fit in queue 'root.c' 
> (requested map[memory:2147483648 pods:1 vcore:500], available 
> map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 
> hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 
> vcore:-56150]),Type:Normal,DeprecatedSource:{ 
> },DeprecatedFirstTimestamp:0001-01-01 00:00:00 + 
> UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 + UTC,DeprecatedCount:0,}"
> 2024-07-08T10:19:05.391Z    INFO    core.scheduler    
> scheduler/scheduler.go:101    Found outstanding requests that will trigger 
> autoscaling    {"number of requests": 1, "total resources": 
> "map[memory:2147483648 pods:1 vcore:500]"}
> E0708 10:19:05.601679       1 event_broadcaster.go:270] "Server rejected 
> event (will not retry!)" err="Event 
> \"example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4\" is 
> invalid: [action: Required value, reason: Required value]" 
> event="&Event{ObjectMeta:{example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4
>   e    0 0001-01-01 00:00:00 + 

[jira] [Updated] (YUNIKORN-2731) YuniKorn stopped scheduling new containers with negative vcore in queue

2024-07-08 Thread Xi Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xi Chen updated YUNIKORN-2731:
--
Attachment: Applications stuck in Accepted status.png

> YuniKorn stopped scheduling new containers with negative vcore in queue
> ---
>
> Key: YUNIKORN-2731
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2731
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Affects Versions: 1.5.1
>Reporter: Xi Chen
>Priority: Major
> Attachments: Applications stuck in Accepted status.png
>
>
> We have encountered this issue in one of our clusters every a few days. We 
> are running a version that is built from branch 
> [https://github.com/apache/yunikorn-k8shim/commits/branch-1.5/] commit 
> fb4e3f11345e6a9866dfaea97770c94b9421807b.
> Here is our configuration of queues.yaml.
>  
> {code:java}
> partitions:
>   - name: default
> nodesortpolicy:
>   type: binpacking
> preemption:
>   enabled: false
> placementrules:
>   - name: tag
> value: namespace
> create: false
> queues:
>   - name: root
> submitacl: '*'
> queues:
>   - name: c
> resources:
>   guaranteed:
> memory: 13000Gi
> vcore: 3250
>   max:
> memory: 13000Gi
> vcore: 3250
> properties:
>   application.sort.policy: fair
>   - name: e
> resources:
>   guaranteed:
> memory: 2600Gi
> vcore: 650
>   max:
> memory: 2600Gi
> vcore: 650
> properties:
>   application.sort.policy: fair
>   - name: m1
> resources:
>   guaranteed:
> memory: 1000Gi
> vcore: 250
>   max:
> memory: 1000Gi
> vcore: 250
> properties:
>   application.sort.policy: fair
>   - name: m2
> resources:
>   guaranteed:
> memory: 62000Gi
> vcore: 15500
>   max:
> memory: 62000Gi
> vcore: 15500
> properties:
>   application.sort.policy: fair {code}
> The issue is that at some point the scheduler would stop starting new 
> containers, and there would be 0 containers running finally and lots of 
> applications in Accepted status.
>  
> There are some logs that contains negative vcore resource, and these logs are 
> highly corralated with this issue in timeline.
>  
> {code:java}
> 2024-07-08T10:19:13.436Z    INFO    core.scheduler    
> scheduler/scheduler.go:101    Found outstanding requests that will trigger 
> autoscaling    {"number of requests": 1, "total resources": 
> "map[memory:2147483648 pods:1 vcore:500]"}
> E0708 10:19:13.604563       1 event_broadcaster.go:270] "Server rejected 
> event (will not retry!)" err="Event 
> \"example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76\" is 
> invalid: [action: Required value, reason: Required value]" 
> event="&Event{ObjectMeta:{example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76
>   c    0 0001-01-01 00:00:00 + UTC   map[] map[] [] [] 
> []},EventTime:2024-07-08 10:19:13.60205325 + UTC 
> m=+524778.657618517,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod
>  c example-job-1720433945-574-aa32179091daba13-driver 
> a1901455-3a23-4dbc-bbb2-9dd2dfc775ea v1 201821875 },Related:nil,Note:Request 
> 'a1901455-3a23-4dbc-bbb2-9dd2dfc775ea' does not fit in queue 'root.c' 
> (requested map[memory:2147483648 pods:1 vcore:500], available 
> map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 
> hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 
> vcore:-56150]),Type:Normal,DeprecatedSource:{ 
> },DeprecatedFirstTimestamp:0001-01-01 00:00:00 + 
> UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 + UTC,DeprecatedCount:0,}"
> 2024-07-08T10:19:05.391Z    INFO    core.scheduler    
> scheduler/scheduler.go:101    Found outstanding requests that will trigger 
> autoscaling    {"number of requests": 1, "total resources": 
> "map[memory:2147483648 pods:1 vcore:500]"}
> E0708 10:19:05.601679       1 event_broadcaster.go:270] "Server rejected 
> event (will not retry!)" err="Event 
> \"example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4\" is 
> invalid: [action: Required value, reason: Required value]" 
> event="&Event{ObjectMeta:{example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4
>   e    0 0001-01-01 00:00:00 +000

[jira] [Created] (YUNIKORN-2731) YuniKorn stopped scheduling new containers with negative vcore in queue

2024-07-08 Thread Xi Chen (Jira)
Xi Chen created YUNIKORN-2731:
-

 Summary: YuniKorn stopped scheduling new containers with negative 
vcore in queue
 Key: YUNIKORN-2731
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2731
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: core - scheduler
Affects Versions: 1.5.1
Reporter: Xi Chen


We have encountered this issue in one of our clusters every a few days. We are 
running a version that is built from branch 
[https://github.com/apache/yunikorn-k8shim/commits/branch-1.5/] commit 
fb4e3f11345e6a9866dfaea97770c94b9421807b.

Here is our configuration of queues.yaml.

 
{code:java}
partitions:
  - name: default
nodesortpolicy:
  type: binpacking
preemption:
  enabled: false
placementrules:
  - name: tag
value: namespace
create: false
queues:
  - name: root
submitacl: '*'
queues:
  - name: c
resources:
  guaranteed:
memory: 13000Gi
vcore: 3250
  max:
memory: 13000Gi
vcore: 3250
properties:
  application.sort.policy: fair
  - name: e
resources:
  guaranteed:
memory: 2600Gi
vcore: 650
  max:
memory: 2600Gi
vcore: 650
properties:
  application.sort.policy: fair
  - name: m1
resources:
  guaranteed:
memory: 1000Gi
vcore: 250
  max:
memory: 1000Gi
vcore: 250
properties:
  application.sort.policy: fair
  - name: m2
resources:
  guaranteed:
memory: 62000Gi
vcore: 15500
  max:
memory: 62000Gi
vcore: 15500
properties:
  application.sort.policy: fair {code}
The issue is that at some point the scheduler would stop starting new 
containers, and there would be 0 containers running finally and lots of 
applications in Accepted status.

 

There are some logs that contains negative vcore resource, and these logs are 
highly corralated with this issue in timeline.

 
{code:java}
2024-07-08T10:19:13.436Z    INFO    core.scheduler    
scheduler/scheduler.go:101    Found outstanding requests that will trigger 
autoscaling    {"number of requests": 1, "total resources": 
"map[memory:2147483648 pods:1 vcore:500]"}
E0708 10:19:13.604563       1 event_broadcaster.go:270] "Server rejected event 
(will not retry!)" err="Event 
\"example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76\" is 
invalid: [action: Required value, reason: Required value]" 
event="&Event{ObjectMeta:{example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76
  c    0 0001-01-01 00:00:00 + UTC   map[] map[] [] [] 
[]},EventTime:2024-07-08 10:19:13.60205325 + UTC 
m=+524778.657618517,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod
 c example-job-1720433945-574-aa32179091daba13-driver 
a1901455-3a23-4dbc-bbb2-9dd2dfc775ea v1 201821875 },Related:nil,Note:Request 
'a1901455-3a23-4dbc-bbb2-9dd2dfc775ea' does not fit in queue 'root.c' 
(requested map[memory:2147483648 pods:1 vcore:500], available 
map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 
hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 
vcore:-56150]),Type:Normal,DeprecatedSource:{ 
},DeprecatedFirstTimestamp:0001-01-01 00:00:00 + 
UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 + UTC,DeprecatedCount:0,}"

2024-07-08T10:19:05.391Z    INFO    core.scheduler    
scheduler/scheduler.go:101    Found outstanding requests that will trigger 
autoscaling    {"number of requests": 1, "total resources": 
"map[memory:2147483648 pods:1 vcore:500]"}
E0708 10:19:05.601679       1 event_broadcaster.go:270] "Server rejected event 
(will not retry!)" err="Event 
\"example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4\" is 
invalid: [action: Required value, reason: Required value]" 
event="&Event{ObjectMeta:{example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4
  e    0 0001-01-01 00:00:00 + UTC   map[] map[] [] [] 
[]},EventTime:2024-07-08 10:19:05.599216316 + UTC 
m=+524770.654781585,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod
 e example-job-1720433937-295-e7b2229091da99a7-driver 
14a40bac-5e89-4293-bcb7-936c544694a2 v1 201821666 },Related:nil,Note:Request 
'14a40bac-5e89-4293-bcb7-936c544694a2' does not fit in queue 'root.e' 
(requested map[memory:2147483648 pods:1 vcore:500], available 
map[ephemeral-storage:2689798906768 

[jira] [Commented] (YUNIKORN-2629) Adding a node can result in a deadlock

2024-07-08 Thread Xi Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17863801#comment-17863801
 ] 

Xi Chen commented on YUNIKORN-2629:
---

[~pbacsko] This is the latest stack we got from the YuniKorn stopped scheduling 
issue: [^yunikorn_stuck_stack_20240708.txt]

> Adding a node can result in a deadlock
> --
>
> Key: YUNIKORN-2629
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Affects Versions: 1.5.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.5.2
>
> Attachments: updateNode_deadlock_trace.txt, 
> yunikorn-scheduler-20240627.log, yunikorn_stuck_stack_20240708.txt
>
>
> Adding a new node after Yunikorn state initialization can result in a 
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting 
> for the {{NodeAccepted}} event:
> {noformat}
>dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
> func(event interface{}) {
>   nodeEvent, ok := event.(CachedSchedulerNodeEvent)
>   if !ok {
>   return
>   }
>   [...] removed for clarity
>   wg.Done()
>   })
>   defer dispatcher.UnregisterEventHandler(handlerID, 
> dispatcher.EventTypeNode)
>   if err := 
> ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode(&si.NodeRequest{
>   Nodes: nodesToRegister,
>   RmID:  schedulerconf.GetSchedulerConf().ClusterID,
>   }); err != nil {
>   log.Log(log.ShimContext).Error("Failed to register nodes", 
> zap.Error(err))
>   return nil, err
>   }
>   // wait for all responses to accumulate
>   wg.Wait()  <--- shim gets stuck here
>  {noformat}
> If tasks are being processed, then the dispatcher will try to retrieve the 
> evend handler, which is returned from Context:
> {noformat}
> go func() {
>   for {
>   select {
>   case event := <-getDispatcher().eventChan:
>   switch v := event.(type) {
>   case events.TaskEvent:
>   getEventHandler(EventTypeTask)(v)  <--- 
> eventually calls Context.getTask()
>   case events.ApplicationEvent:
>   getEventHandler(EventTypeApp)(v)
>   case events.SchedulerNodeEvent:
>   getEventHandler(EventTypeNode)(v)  
> {noformat}
> Since {{addNode()}} is holding a write lock, the event processing loop gets 
> stuck, so {{registerNodes()}} will never progress.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2629) Adding a node can result in a deadlock

2024-07-08 Thread Xi Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xi Chen updated YUNIKORN-2629:
--
Attachment: yunikorn_stuck_stack_20240708.txt

> Adding a node can result in a deadlock
> --
>
> Key: YUNIKORN-2629
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Affects Versions: 1.5.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.5.2
>
> Attachments: updateNode_deadlock_trace.txt, 
> yunikorn-scheduler-20240627.log, yunikorn_stuck_stack_20240708.txt
>
>
> Adding a new node after Yunikorn state initialization can result in a 
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting 
> for the {{NodeAccepted}} event:
> {noformat}
>dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
> func(event interface{}) {
>   nodeEvent, ok := event.(CachedSchedulerNodeEvent)
>   if !ok {
>   return
>   }
>   [...] removed for clarity
>   wg.Done()
>   })
>   defer dispatcher.UnregisterEventHandler(handlerID, 
> dispatcher.EventTypeNode)
>   if err := 
> ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode(&si.NodeRequest{
>   Nodes: nodesToRegister,
>   RmID:  schedulerconf.GetSchedulerConf().ClusterID,
>   }); err != nil {
>   log.Log(log.ShimContext).Error("Failed to register nodes", 
> zap.Error(err))
>   return nil, err
>   }
>   // wait for all responses to accumulate
>   wg.Wait()  <--- shim gets stuck here
>  {noformat}
> If tasks are being processed, then the dispatcher will try to retrieve the 
> evend handler, which is returned from Context:
> {noformat}
> go func() {
>   for {
>   select {
>   case event := <-getDispatcher().eventChan:
>   switch v := event.(type) {
>   case events.TaskEvent:
>   getEventHandler(EventTypeTask)(v)  <--- 
> eventually calls Context.getTask()
>   case events.ApplicationEvent:
>   getEventHandler(EventTypeApp)(v)
>   case events.SchedulerNodeEvent:
>   getEventHandler(EventTypeNode)(v)  
> {noformat}
> Since {{addNode()}} is holding a write lock, the event processing loop gets 
> stuck, so {{registerNodes()}} will never progress.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2629) Adding a node can result in a deadlock

2024-07-03 Thread Xi Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17862915#comment-17862915
 ] 

Xi Chen commented on YUNIKORN-2629:
---

[~pbacsko] Thanks for the suggestion! We'll run it without preemption.
 

> Adding a node can result in a deadlock
> --
>
> Key: YUNIKORN-2629
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Affects Versions: 1.5.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.5.2
>
> Attachments: updateNode_deadlock_trace.txt, 
> yunikorn-scheduler-20240627.log
>
>
> Adding a new node after Yunikorn state initialization can result in a 
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting 
> for the {{NodeAccepted}} event:
> {noformat}
>dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
> func(event interface{}) {
>   nodeEvent, ok := event.(CachedSchedulerNodeEvent)
>   if !ok {
>   return
>   }
>   [...] removed for clarity
>   wg.Done()
>   })
>   defer dispatcher.UnregisterEventHandler(handlerID, 
> dispatcher.EventTypeNode)
>   if err := 
> ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode(&si.NodeRequest{
>   Nodes: nodesToRegister,
>   RmID:  schedulerconf.GetSchedulerConf().ClusterID,
>   }); err != nil {
>   log.Log(log.ShimContext).Error("Failed to register nodes", 
> zap.Error(err))
>   return nil, err
>   }
>   // wait for all responses to accumulate
>   wg.Wait()  <--- shim gets stuck here
>  {noformat}
> If tasks are being processed, then the dispatcher will try to retrieve the 
> evend handler, which is returned from Context:
> {noformat}
> go func() {
>   for {
>   select {
>   case event := <-getDispatcher().eventChan:
>   switch v := event.(type) {
>   case events.TaskEvent:
>   getEventHandler(EventTypeTask)(v)  <--- 
> eventually calls Context.getTask()
>   case events.ApplicationEvent:
>   getEventHandler(EventTypeApp)(v)
>   case events.SchedulerNodeEvent:
>   getEventHandler(EventTypeNode)(v)  
> {noformat}
> Since {{addNode()}} is holding a write lock, the event processing loop gets 
> stuck, so {{registerNodes()}} will never progress.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2629) Adding a node can result in a deadlock

2024-07-03 Thread Xi Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17862843#comment-17862843
 ] 

Xi Chen commented on YUNIKORN-2629:
---

Good to know it's a false positive. But there was a real issue that the 
scheduler stopped scheduling new containers until restart. There were some WARN 
messages before the scheduler started to hang and print the deadlock messages:
{code:java}
2024-07-02T08:10:15.418Z    WARN    core.metrics    
metrics/metrics_collector.go:90    Could not calculate the 
totalContainersRunning.    {"allocatedContainers": 895708, 
"releasedContainers": 895720}
2024-07-02T08:10:15.420Z    WARN    core.scheduler.health    
scheduler/health_checker.go:178    Scheduler is not healthy    {"name": 
"Consistency of data", "description": "Check if a partition's allocated 
resource <= total resource of the partition", "message": "Partitions with 
inconsistent data: [\"[foo-cluster-name]default\"]"} {code}
[~pbacsko] Could you tell me what else information is needed for debugging? I 
can collect them if it happens again.

> Adding a node can result in a deadlock
> --
>
> Key: YUNIKORN-2629
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Affects Versions: 1.5.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.5.2
>
> Attachments: updateNode_deadlock_trace.txt, 
> yunikorn-scheduler-20240627.log
>
>
> Adding a new node after Yunikorn state initialization can result in a 
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting 
> for the {{NodeAccepted}} event:
> {noformat}
>dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
> func(event interface{}) {
>   nodeEvent, ok := event.(CachedSchedulerNodeEvent)
>   if !ok {
>   return
>   }
>   [...] removed for clarity
>   wg.Done()
>   })
>   defer dispatcher.UnregisterEventHandler(handlerID, 
> dispatcher.EventTypeNode)
>   if err := 
> ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode(&si.NodeRequest{
>   Nodes: nodesToRegister,
>   RmID:  schedulerconf.GetSchedulerConf().ClusterID,
>   }); err != nil {
>   log.Log(log.ShimContext).Error("Failed to register nodes", 
> zap.Error(err))
>   return nil, err
>   }
>   // wait for all responses to accumulate
>   wg.Wait()  <--- shim gets stuck here
>  {noformat}
> If tasks are being processed, then the dispatcher will try to retrieve the 
> evend handler, which is returned from Context:
> {noformat}
> go func() {
>   for {
>   select {
>   case event := <-getDispatcher().eventChan:
>   switch v := event.(type) {
>   case events.TaskEvent:
>   getEventHandler(EventTypeTask)(v)  <--- 
> eventually calls Context.getTask()
>   case events.ApplicationEvent:
>   getEventHandler(EventTypeApp)(v)
>   case events.SchedulerNodeEvent:
>   getEventHandler(EventTypeNode)(v)  
> {noformat}
> Since {{addNode()}} is holding a write lock, the event processing loop gets 
> stuck, so {{registerNodes()}} will never progress.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Comment Edited] (YUNIKORN-2629) Adding a node can result in a deadlock

2024-07-02 Thread Xi Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17862661#comment-17862661
 ] 

Xi Chen edited comment on YUNIKORN-2629 at 7/3/24 6:00 AM:
---

[~pbacsko] Hi, I don't know if this is the right place, but we've found a 
potential new issue of deadlock. It happened 2 times during the past week in 
one of our production environment. We are using different queues for different 
namespaces, and binpacking policy enabled. The deadlock detection output is 
uploaded [^yunikorn-scheduler-20240627.log]. We are running the scheduler built 
from this branch with the fix for this ticket (early version of branch-1.5): 
[https://github.com/apache/yunikorn-k8shim/tree/fb4e3f11345e6a9866dfaea97770c94b9421807b]
 

Please let me know if this should be a new Jira ticket, thanks!


was (Author: jshmchenxi):
Hi, I don't know if this is the right place, but we've found a potential new 
issue of deadlock. It happened 2 times during the past week in one of our 
production environment. We are using different queues for different namespaces, 
and binpacking policy enabled. The deadlock detection output is uploaded 
[^yunikorn-scheduler-20240627.log]. We are running the scheduler built from 
this branch with the fix for this ticket (early version of branch-1.5): 
[https://github.com/apache/yunikorn-k8shim/tree/fb4e3f11345e6a9866dfaea97770c94b9421807b]
 

 

> Adding a node can result in a deadlock
> --
>
> Key: YUNIKORN-2629
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Affects Versions: 1.5.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.5.2
>
> Attachments: updateNode_deadlock_trace.txt, 
> yunikorn-scheduler-20240627.log
>
>
> Adding a new node after Yunikorn state initialization can result in a 
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting 
> for the {{NodeAccepted}} event:
> {noformat}
>dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
> func(event interface{}) {
>   nodeEvent, ok := event.(CachedSchedulerNodeEvent)
>   if !ok {
>   return
>   }
>   [...] removed for clarity
>   wg.Done()
>   })
>   defer dispatcher.UnregisterEventHandler(handlerID, 
> dispatcher.EventTypeNode)
>   if err := 
> ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode(&si.NodeRequest{
>   Nodes: nodesToRegister,
>   RmID:  schedulerconf.GetSchedulerConf().ClusterID,
>   }); err != nil {
>   log.Log(log.ShimContext).Error("Failed to register nodes", 
> zap.Error(err))
>   return nil, err
>   }
>   // wait for all responses to accumulate
>   wg.Wait()  <--- shim gets stuck here
>  {noformat}
> If tasks are being processed, then the dispatcher will try to retrieve the 
> evend handler, which is returned from Context:
> {noformat}
> go func() {
>   for {
>   select {
>   case event := <-getDispatcher().eventChan:
>   switch v := event.(type) {
>   case events.TaskEvent:
>   getEventHandler(EventTypeTask)(v)  <--- 
> eventually calls Context.getTask()
>   case events.ApplicationEvent:
>   getEventHandler(EventTypeApp)(v)
>   case events.SchedulerNodeEvent:
>   getEventHandler(EventTypeNode)(v)  
> {noformat}
> Since {{addNode()}} is holding a write lock, the event processing loop gets 
> stuck, so {{registerNodes()}} will never progress.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2629) Adding a node can result in a deadlock

2024-07-02 Thread Xi Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17862661#comment-17862661
 ] 

Xi Chen commented on YUNIKORN-2629:
---

Hi, I don't know if this is the right place, but we've found a potential new 
issue of deadlock. It happened 2 times during the past week in one of our 
production environment. We are using different queues for different namespaces, 
and binpacking policy enabled. The deadlock detection output is uploaded 
[^yunikorn-scheduler-20240627.log]. We are running the scheduler built from 
this branch with the fix for this ticket (early version of branch-1.5): 
[https://github.com/apache/yunikorn-k8shim/tree/fb4e3f11345e6a9866dfaea97770c94b9421807b]
 

 

> Adding a node can result in a deadlock
> --
>
> Key: YUNIKORN-2629
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Affects Versions: 1.5.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.5.2
>
> Attachments: updateNode_deadlock_trace.txt, 
> yunikorn-scheduler-20240627.log
>
>
> Adding a new node after Yunikorn state initialization can result in a 
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting 
> for the {{NodeAccepted}} event:
> {noformat}
>dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
> func(event interface{}) {
>   nodeEvent, ok := event.(CachedSchedulerNodeEvent)
>   if !ok {
>   return
>   }
>   [...] removed for clarity
>   wg.Done()
>   })
>   defer dispatcher.UnregisterEventHandler(handlerID, 
> dispatcher.EventTypeNode)
>   if err := 
> ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode(&si.NodeRequest{
>   Nodes: nodesToRegister,
>   RmID:  schedulerconf.GetSchedulerConf().ClusterID,
>   }); err != nil {
>   log.Log(log.ShimContext).Error("Failed to register nodes", 
> zap.Error(err))
>   return nil, err
>   }
>   // wait for all responses to accumulate
>   wg.Wait()  <--- shim gets stuck here
>  {noformat}
> If tasks are being processed, then the dispatcher will try to retrieve the 
> evend handler, which is returned from Context:
> {noformat}
> go func() {
>   for {
>   select {
>   case event := <-getDispatcher().eventChan:
>   switch v := event.(type) {
>   case events.TaskEvent:
>   getEventHandler(EventTypeTask)(v)  <--- 
> eventually calls Context.getTask()
>   case events.ApplicationEvent:
>   getEventHandler(EventTypeApp)(v)
>   case events.SchedulerNodeEvent:
>   getEventHandler(EventTypeNode)(v)  
> {noformat}
> Since {{addNode()}} is holding a write lock, the event processing loop gets 
> stuck, so {{registerNodes()}} will never progress.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2629) Adding a node can result in a deadlock

2024-07-02 Thread Xi Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xi Chen updated YUNIKORN-2629:
--
Attachment: yunikorn-scheduler-20240627.log

> Adding a node can result in a deadlock
> --
>
> Key: YUNIKORN-2629
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Affects Versions: 1.5.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.5.2
>
> Attachments: updateNode_deadlock_trace.txt, 
> yunikorn-scheduler-20240627.log
>
>
> Adding a new node after Yunikorn state initialization can result in a 
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting 
> for the {{NodeAccepted}} event:
> {noformat}
>dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
> func(event interface{}) {
>   nodeEvent, ok := event.(CachedSchedulerNodeEvent)
>   if !ok {
>   return
>   }
>   [...] removed for clarity
>   wg.Done()
>   })
>   defer dispatcher.UnregisterEventHandler(handlerID, 
> dispatcher.EventTypeNode)
>   if err := 
> ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode(&si.NodeRequest{
>   Nodes: nodesToRegister,
>   RmID:  schedulerconf.GetSchedulerConf().ClusterID,
>   }); err != nil {
>   log.Log(log.ShimContext).Error("Failed to register nodes", 
> zap.Error(err))
>   return nil, err
>   }
>   // wait for all responses to accumulate
>   wg.Wait()  <--- shim gets stuck here
>  {noformat}
> If tasks are being processed, then the dispatcher will try to retrieve the 
> evend handler, which is returned from Context:
> {noformat}
> go func() {
>   for {
>   select {
>   case event := <-getDispatcher().eventChan:
>   switch v := event.(type) {
>   case events.TaskEvent:
>   getEventHandler(EventTypeTask)(v)  <--- 
> eventually calls Context.getTask()
>   case events.ApplicationEvent:
>   getEventHandler(EventTypeApp)(v)
>   case events.SchedulerNodeEvent:
>   getEventHandler(EventTypeNode)(v)  
> {noformat}
> Since {{addNode()}} is holding a write lock, the event processing loop gets 
> stuck, so {{registerNodes()}} will never progress.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2521) Scheduler deadlock

2024-04-27 Thread Xi Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841442#comment-17841442
 ] 

Xi Chen commented on YUNIKORN-2521:
---

No new stacks were found other than the preemption. Thanks for fixing this! 
[~pbacsko] [~ccondit] 

> Scheduler deadlock
> --
>
> Key: YUNIKORN-2521
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2521
> Project: Apache YuniKorn
>  Issue Type: Bug
>Affects Versions: 1.5.0
> Environment: Yunikorn: 1.5
> AWS EKS: v1.28.6-eks-508b6b3
>Reporter: Noah Yoshida
>Assignee: Peter Bacsko
>Priority: Critical
> Fix For: 1.6.0, 1.5.1
>
> Attachments: 0001-YUNIKORN-2539-core.patch, 
> 0002-YUNIKORN-2539-k8shim.patch, 4_4_goroutine-1.txt, 4_4_goroutine-2.txt, 
> 4_4_goroutine-3.txt, 4_4_goroutine-4.txt, 4_4_goroutine-5-state-dump.txt, 
> 4_4_profile001.png, 4_4_profile002.png, 4_4_profile003.png, 
> 4_4_scheduler-logs.txt, deadlock_2024-04-18.log, goroutine-4-3-1.out, 
> goroutine-4-3-2.out, goroutine-4-3-3.out, goroutine-4-3.out, 
> goroutine-4-5.out, goroutine-dump.txt, goroutine-while-blocking-2.out, 
> goroutine-while-blocking.out, logs-potential-deadlock-2.txt, 
> logs-potential-deadlock.txt, logs-splunk-ordered.txt, logs-splunk.txt, 
> profile001-4-5.gif, profile012.gif, profile013.gif, running-logs-2.txt, 
> running-logs.txt
>
>
> Discussion on Yunikorn slack: 
> [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1711048995187179]
> Occasionally, Yunikorn will deadlock and prevent any new pods from starting. 
> All pods stay in Pending. There are no error logs inside of the Yunikorn 
> scheduler indicating any issue. 
> Additionally, the pods all have the correct annotations / labels from the 
> admission service, so they are at least getting put into k8s correctly. 
> The issue was seen intermittently on Yunikorn version 1.5 in EKS, using 
> version `v1.28.6-eks-508b6b3`. 
> At least for me, we run about 25-50 nodes and 200-400 pods. Pods and nodes 
> are added and removed pretty frequently as we do ML workloads. 
> Attached is the goroutine dump. We were not able to get a statedump as the 
> endpoint kept timing out. 
> You can fix it by restarting the Yunikorn scheduler pod. Sometimes you also 
> have to delete any "Pending" pods that got stuck while the scheduler was 
> deadlocked as well, for them to get picked up by the new scheduler pod. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Comment Edited] (YUNIKORN-2521) Scheduler deadlock

2024-04-18 Thread Xi Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17838821#comment-17838821
 ] 

Xi Chen edited comment on YUNIKORN-2521 at 4/19/24 4:35 AM:


[~ccondit], [~pbacsko] Thanks for looking into this! We are currently relying 
on the deadlock detection to restart YuniKorn. I can switch to livenessProbe 
using `/ws/v1/fullstatedump` to see if this really hangs the scheduler.

*Update:*

I switched deadlock exit to livenessProbe and the scheduler works without 
restarting. But the POTENTIAL DEADLOCK logs are still there. So this is likely 
a false positive.
 


was (Author: jshmchenxi):
[~ccondit], [~pbacsko] Thanks for looking into this! We are currently relying 
on the deadlock detection to restart YuniKorn. I can switch to livenessProbe 
using `/ws/v1/fullstatedump` to see if this really hangs the scheduler. WDYT?

> Scheduler deadlock
> --
>
> Key: YUNIKORN-2521
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2521
> Project: Apache YuniKorn
>  Issue Type: Bug
>Affects Versions: 1.5.0
> Environment: Yunikorn: 1.5
> AWS EKS: v1.28.6-eks-508b6b3
>Reporter: Noah Yoshida
>Assignee: Craig Condit
>Priority: Critical
> Attachments: 0001-YUNIKORN-2539-core.patch, 
> 0002-YUNIKORN-2539-k8shim.patch, 4_4_goroutine-1.txt, 4_4_goroutine-2.txt, 
> 4_4_goroutine-3.txt, 4_4_goroutine-4.txt, 4_4_goroutine-5-state-dump.txt, 
> 4_4_profile001.png, 4_4_profile002.png, 4_4_profile003.png, 
> 4_4_scheduler-logs.txt, deadlock_2024-04-18.log, goroutine-4-3-1.out, 
> goroutine-4-3-2.out, goroutine-4-3-3.out, goroutine-4-3.out, 
> goroutine-4-5.out, goroutine-dump.txt, goroutine-while-blocking-2.out, 
> goroutine-while-blocking.out, logs-potential-deadlock-2.txt, 
> logs-potential-deadlock.txt, logs-splunk-ordered.txt, logs-splunk.txt, 
> profile001-4-5.gif, profile012.gif, profile013.gif, running-logs-2.txt, 
> running-logs.txt
>
>
> Discussion on Yunikorn slack: 
> [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1711048995187179]
> Occasionally, Yunikorn will deadlock and prevent any new pods from starting. 
> All pods stay in Pending. There are no error logs inside of the Yunikorn 
> scheduler indicating any issue. 
> Additionally, the pods all have the correct annotations / labels from the 
> admission service, so they are at least getting put into k8s correctly. 
> The issue was seen intermittently on Yunikorn version 1.5 in EKS, using 
> version `v1.28.6-eks-508b6b3`. 
> At least for me, we run about 25-50 nodes and 200-400 pods. Pods and nodes 
> are added and removed pretty frequently as we do ML workloads. 
> Attached is the goroutine dump. We were not able to get a statedump as the 
> endpoint kept timing out. 
> You can fix it by restarting the Yunikorn scheduler pod. Sometimes you also 
> have to delete any "Pending" pods that got stuck while the scheduler was 
> deadlocked as well, for them to get picked up by the new scheduler pod. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2521) Scheduler deadlock

2024-04-18 Thread Xi Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17838821#comment-17838821
 ] 

Xi Chen commented on YUNIKORN-2521:
---

[~ccondit], [~pbacsko] Thanks for looking into this! We are currently relying 
on the deadlock detection to restart YuniKorn. I can switch to livenessProbe 
using `/ws/v1/fullstatedump` to see if this really hangs the scheduler. WDYT?

> Scheduler deadlock
> --
>
> Key: YUNIKORN-2521
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2521
> Project: Apache YuniKorn
>  Issue Type: Bug
>Affects Versions: 1.5.0
> Environment: Yunikorn: 1.5
> AWS EKS: v1.28.6-eks-508b6b3
>Reporter: Noah Yoshida
>Assignee: Craig Condit
>Priority: Critical
> Attachments: 0001-YUNIKORN-2539-core.patch, 
> 0002-YUNIKORN-2539-k8shim.patch, 4_4_goroutine-1.txt, 4_4_goroutine-2.txt, 
> 4_4_goroutine-3.txt, 4_4_goroutine-4.txt, 4_4_goroutine-5-state-dump.txt, 
> 4_4_profile001.png, 4_4_profile002.png, 4_4_profile003.png, 
> 4_4_scheduler-logs.txt, deadlock_2024-04-18.log, goroutine-4-3-1.out, 
> goroutine-4-3-2.out, goroutine-4-3-3.out, goroutine-4-3.out, 
> goroutine-4-5.out, goroutine-dump.txt, goroutine-while-blocking-2.out, 
> goroutine-while-blocking.out, logs-potential-deadlock-2.txt, 
> logs-potential-deadlock.txt, logs-splunk-ordered.txt, logs-splunk.txt, 
> profile001-4-5.gif, profile012.gif, profile013.gif, running-logs-2.txt, 
> running-logs.txt
>
>
> Discussion on Yunikorn slack: 
> [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1711048995187179]
> Occasionally, Yunikorn will deadlock and prevent any new pods from starting. 
> All pods stay in Pending. There are no error logs inside of the Yunikorn 
> scheduler indicating any issue. 
> Additionally, the pods all have the correct annotations / labels from the 
> admission service, so they are at least getting put into k8s correctly. 
> The issue was seen intermittently on Yunikorn version 1.5 in EKS, using 
> version `v1.28.6-eks-508b6b3`. 
> At least for me, we run about 25-50 nodes and 200-400 pods. Pods and nodes 
> are added and removed pretty frequently as we do ML workloads. 
> Attached is the goroutine dump. We were not able to get a statedump as the 
> endpoint kept timing out. 
> You can fix it by restarting the Yunikorn scheduler pod. Sometimes you also 
> have to delete any "Pending" pods that got stuck while the scheduler was 
> deadlocked as well, for them to get picked up by the new scheduler pod. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2521) Scheduler deadlock

2024-04-18 Thread Xi Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17838707#comment-17838707
 ] 

Xi Chen commented on YUNIKORN-2521:
---

Hey [~pbacsko], I used the branch-1.5 to build the scheduler and deployed in 
our environment. Our settings include multiple namespaces running Spark K8s 
jobs. The scheduler was still getting POTENTIAL DEADLOCK. Please check the log 
[^deadlock_2024-04-18.log], thanks!
 

> Scheduler deadlock
> --
>
> Key: YUNIKORN-2521
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2521
> Project: Apache YuniKorn
>  Issue Type: Bug
>Affects Versions: 1.5.0
> Environment: Yunikorn: 1.5
> AWS EKS: v1.28.6-eks-508b6b3
>Reporter: Noah Yoshida
>Assignee: Craig Condit
>Priority: Critical
> Attachments: 0001-YUNIKORN-2539-core.patch, 
> 0002-YUNIKORN-2539-k8shim.patch, 4_4_goroutine-1.txt, 4_4_goroutine-2.txt, 
> 4_4_goroutine-3.txt, 4_4_goroutine-4.txt, 4_4_goroutine-5-state-dump.txt, 
> 4_4_profile001.png, 4_4_profile002.png, 4_4_profile003.png, 
> 4_4_scheduler-logs.txt, deadlock_2024-04-18.log, goroutine-4-3-1.out, 
> goroutine-4-3-2.out, goroutine-4-3-3.out, goroutine-4-3.out, 
> goroutine-4-5.out, goroutine-dump.txt, goroutine-while-blocking-2.out, 
> goroutine-while-blocking.out, logs-potential-deadlock-2.txt, 
> logs-potential-deadlock.txt, logs-splunk-ordered.txt, logs-splunk.txt, 
> profile001-4-5.gif, profile012.gif, profile013.gif, running-logs-2.txt, 
> running-logs.txt
>
>
> Discussion on Yunikorn slack: 
> [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1711048995187179]
> Occasionally, Yunikorn will deadlock and prevent any new pods from starting. 
> All pods stay in Pending. There are no error logs inside of the Yunikorn 
> scheduler indicating any issue. 
> Additionally, the pods all have the correct annotations / labels from the 
> admission service, so they are at least getting put into k8s correctly. 
> The issue was seen intermittently on Yunikorn version 1.5 in EKS, using 
> version `v1.28.6-eks-508b6b3`. 
> At least for me, we run about 25-50 nodes and 200-400 pods. Pods and nodes 
> are added and removed pretty frequently as we do ML workloads. 
> Attached is the goroutine dump. We were not able to get a statedump as the 
> endpoint kept timing out. 
> You can fix it by restarting the Yunikorn scheduler pod. Sometimes you also 
> have to delete any "Pending" pods that got stuck while the scheduler was 
> deadlocked as well, for them to get picked up by the new scheduler pod. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2521) Scheduler deadlock

2024-04-18 Thread Xi Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xi Chen updated YUNIKORN-2521:
--
Attachment: deadlock_2024-04-18.log

> Scheduler deadlock
> --
>
> Key: YUNIKORN-2521
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2521
> Project: Apache YuniKorn
>  Issue Type: Bug
>Affects Versions: 1.5.0
> Environment: Yunikorn: 1.5
> AWS EKS: v1.28.6-eks-508b6b3
>Reporter: Noah Yoshida
>Assignee: Craig Condit
>Priority: Critical
> Attachments: 0001-YUNIKORN-2539-core.patch, 
> 0002-YUNIKORN-2539-k8shim.patch, 4_4_goroutine-1.txt, 4_4_goroutine-2.txt, 
> 4_4_goroutine-3.txt, 4_4_goroutine-4.txt, 4_4_goroutine-5-state-dump.txt, 
> 4_4_profile001.png, 4_4_profile002.png, 4_4_profile003.png, 
> 4_4_scheduler-logs.txt, deadlock_2024-04-18.log, goroutine-4-3-1.out, 
> goroutine-4-3-2.out, goroutine-4-3-3.out, goroutine-4-3.out, 
> goroutine-4-5.out, goroutine-dump.txt, goroutine-while-blocking-2.out, 
> goroutine-while-blocking.out, logs-potential-deadlock-2.txt, 
> logs-potential-deadlock.txt, logs-splunk-ordered.txt, logs-splunk.txt, 
> profile001-4-5.gif, profile012.gif, profile013.gif, running-logs-2.txt, 
> running-logs.txt
>
>
> Discussion on Yunikorn slack: 
> [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1711048995187179]
> Occasionally, Yunikorn will deadlock and prevent any new pods from starting. 
> All pods stay in Pending. There are no error logs inside of the Yunikorn 
> scheduler indicating any issue. 
> Additionally, the pods all have the correct annotations / labels from the 
> admission service, so they are at least getting put into k8s correctly. 
> The issue was seen intermittently on Yunikorn version 1.5 in EKS, using 
> version `v1.28.6-eks-508b6b3`. 
> At least for me, we run about 25-50 nodes and 200-400 pods. Pods and nodes 
> are added and removed pretty frequently as we do ML workloads. 
> Attached is the goroutine dump. We were not able to get a statedump as the 
> endpoint kept timing out. 
> You can fix it by restarting the Yunikorn scheduler pod. Sometimes you also 
> have to delete any "Pending" pods that got stuck while the scheduler was 
> deadlocked as well, for them to get picked up by the new scheduler pod. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org