[jira] [Closed] (YUNIKORN-2731) YuniKorn stopped scheduling new containers with negative vcore in queue
[ https://issues.apache.org/jira/browse/YUNIKORN-2731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xi Chen closed YUNIKORN-2731. - Fix Version/s: 1.5.2 Resolution: Fixed > YuniKorn stopped scheduling new containers with negative vcore in queue > --- > > Key: YUNIKORN-2731 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2731 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Affects Versions: 1.5.1 >Reporter: Xi Chen >Priority: Major > Fix For: 1.5.2 > > Attachments: Applications stuck in Accepted status.png > > > We have encountered this issue in one of our clusters every a few days. We > are running a version that is built from branch > [https://github.com/apache/yunikorn-k8shim/commits/branch-1.5/] commit > fb4e3f11345e6a9866dfaea97770c94b9421807b. > Here is our configuration of queues.yaml. > {code:java} > partitions: > - name: default > nodesortpolicy: > type: binpacking > preemption: > enabled: false > placementrules: > - name: tag > value: namespace > create: false > queues: > - name: root > submitacl: '*' > queues: > - name: c > resources: > guaranteed: > memory: 13000Gi > vcore: 3250 > max: > memory: 13000Gi > vcore: 3250 > properties: > application.sort.policy: fair > - name: e > resources: > guaranteed: > memory: 2600Gi > vcore: 650 > max: > memory: 2600Gi > vcore: 650 > properties: > application.sort.policy: fair > - name: m1 > resources: > guaranteed: > memory: 1000Gi > vcore: 250 > max: > memory: 1000Gi > vcore: 250 > properties: > application.sort.policy: fair > - name: m2 > resources: > guaranteed: > memory: 62000Gi > vcore: 15500 > max: > memory: 62000Gi > vcore: 15500 > properties: > application.sort.policy: fair {code} > The issue is that at some point the scheduler would stop starting new > containers, and there would be 0 containers running finally and lots of > applications in Accepted status. > !Applications stuck in Accepted status.png|width=1211,height=407! > There are some logs that contains negative vcore resource, and these logs are > highly corralated with this issue in timeline. > {code:java} > 2024-07-08T10:19:13.436Z INFO core.scheduler > scheduler/scheduler.go:101 Found outstanding requests that will trigger > autoscaling {"number of requests": 1, "total resources": > "map[memory:2147483648 pods:1 vcore:500]"} > E0708 10:19:13.604563 1 event_broadcaster.go:270] "Server rejected > event (will not retry!)" err="Event > \"example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76\" is > invalid: [action: Required value, reason: Required value]" > event="&Event{ObjectMeta:{example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76 > c 0 0001-01-01 00:00:00 + UTC map[] map[] [] [] > []},EventTime:2024-07-08 10:19:13.60205325 + UTC > m=+524778.657618517,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod > c example-job-1720433945-574-aa32179091daba13-driver > a1901455-3a23-4dbc-bbb2-9dd2dfc775ea v1 201821875 },Related:nil,Note:Request > 'a1901455-3a23-4dbc-bbb2-9dd2dfc775ea' does not fit in queue 'root.c' > (requested map[memory:2147483648 pods:1 vcore:500], available > map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 > hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 > vcore:-56150]),Type:Normal,DeprecatedSource:{ > },DeprecatedFirstTimestamp:0001-01-01 00:00:00 + > UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 + UTC,DeprecatedCount:0,}" > 2024-07-08T10:19:05.391Z INFO core.scheduler > scheduler/scheduler.go:101 Found outstanding requests that will trigger > autoscaling {"number of requests": 1, "total resources": > "map[memory:2147483648 pods:1 vcore:500]"} > E0708 10:19:05.601679 1 event_broadcaster.go:270] "Server rejected > event (will not retry!)" err="Event > \"example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4\" is > invalid: [action: Required value, reason: Required value]" > event="&Event{ObjectMeta:{example-job-1720433937-295
[jira] [Commented] (YUNIKORN-2731) YuniKorn stopped scheduling new containers with negative vcore in queue
[ https://issues.apache.org/jira/browse/YUNIKORN-2731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877986#comment-17877986 ] Xi Chen commented on YUNIKORN-2731: --- This issue is gone after upgrading to v1.5.2 > YuniKorn stopped scheduling new containers with negative vcore in queue > --- > > Key: YUNIKORN-2731 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2731 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Affects Versions: 1.5.1 >Reporter: Xi Chen >Priority: Major > Attachments: Applications stuck in Accepted status.png > > > We have encountered this issue in one of our clusters every a few days. We > are running a version that is built from branch > [https://github.com/apache/yunikorn-k8shim/commits/branch-1.5/] commit > fb4e3f11345e6a9866dfaea97770c94b9421807b. > Here is our configuration of queues.yaml. > {code:java} > partitions: > - name: default > nodesortpolicy: > type: binpacking > preemption: > enabled: false > placementrules: > - name: tag > value: namespace > create: false > queues: > - name: root > submitacl: '*' > queues: > - name: c > resources: > guaranteed: > memory: 13000Gi > vcore: 3250 > max: > memory: 13000Gi > vcore: 3250 > properties: > application.sort.policy: fair > - name: e > resources: > guaranteed: > memory: 2600Gi > vcore: 650 > max: > memory: 2600Gi > vcore: 650 > properties: > application.sort.policy: fair > - name: m1 > resources: > guaranteed: > memory: 1000Gi > vcore: 250 > max: > memory: 1000Gi > vcore: 250 > properties: > application.sort.policy: fair > - name: m2 > resources: > guaranteed: > memory: 62000Gi > vcore: 15500 > max: > memory: 62000Gi > vcore: 15500 > properties: > application.sort.policy: fair {code} > The issue is that at some point the scheduler would stop starting new > containers, and there would be 0 containers running finally and lots of > applications in Accepted status. > !Applications stuck in Accepted status.png|width=1211,height=407! > There are some logs that contains negative vcore resource, and these logs are > highly corralated with this issue in timeline. > {code:java} > 2024-07-08T10:19:13.436Z INFO core.scheduler > scheduler/scheduler.go:101 Found outstanding requests that will trigger > autoscaling {"number of requests": 1, "total resources": > "map[memory:2147483648 pods:1 vcore:500]"} > E0708 10:19:13.604563 1 event_broadcaster.go:270] "Server rejected > event (will not retry!)" err="Event > \"example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76\" is > invalid: [action: Required value, reason: Required value]" > event="&Event{ObjectMeta:{example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76 > c 0 0001-01-01 00:00:00 + UTC map[] map[] [] [] > []},EventTime:2024-07-08 10:19:13.60205325 + UTC > m=+524778.657618517,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod > c example-job-1720433945-574-aa32179091daba13-driver > a1901455-3a23-4dbc-bbb2-9dd2dfc775ea v1 201821875 },Related:nil,Note:Request > 'a1901455-3a23-4dbc-bbb2-9dd2dfc775ea' does not fit in queue 'root.c' > (requested map[memory:2147483648 pods:1 vcore:500], available > map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 > hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 > vcore:-56150]),Type:Normal,DeprecatedSource:{ > },DeprecatedFirstTimestamp:0001-01-01 00:00:00 + > UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 + UTC,DeprecatedCount:0,}" > 2024-07-08T10:19:05.391Z INFO core.scheduler > scheduler/scheduler.go:101 Found outstanding requests that will trigger > autoscaling {"number of requests": 1, "total resources": > "map[memory:2147483648 pods:1 vcore:500]"} > E0708 10:19:05.601679 1 event_broadcaster.go:270] "Server rejected > event (will not retry!)" err="Event > \"example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4\" is > invalid: [action: Required value, reason: Required value]" > event="&Event{ObjectMeta:{ex
[jira] [Commented] (YUNIKORN-2731) YuniKorn stopped scheduling new containers with negative vcore in queue
[ https://issues.apache.org/jira/browse/YUNIKORN-2731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1784#comment-1784 ] Xi Chen commented on YUNIKORN-2731: --- This is the root queue status in another incident, in which the allocated CPU is greater than max CPU: {code:java} Queue Info Name: root Status: Active Allocated: - Memory: 575.06 GiB - CPU: 261 - pods: 69 Pending: - Memory: 332.19 GiB - CPU: 38 - pods: 76 Max: - Memory: 1.19 TiB - CPU: 205.55 - pods: 1.79k - ephemeral-storage: 3.07 TB - hugepages-1Gi: 0 B - hugepages-2Mi: 0 B - hugepages-32Mi: 0 B - hugepages-64Ki: 0 B Guaranteed: - n/a Absolute Used Capacity: - Memory: 47% - CPU: 126% {code} > YuniKorn stopped scheduling new containers with negative vcore in queue > --- > > Key: YUNIKORN-2731 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2731 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Affects Versions: 1.5.1 >Reporter: Xi Chen >Priority: Major > Attachments: Applications stuck in Accepted status.png > > > We have encountered this issue in one of our clusters every a few days. We > are running a version that is built from branch > [https://github.com/apache/yunikorn-k8shim/commits/branch-1.5/] commit > fb4e3f11345e6a9866dfaea97770c94b9421807b. > Here is our configuration of queues.yaml. > {code:java} > partitions: > - name: default > nodesortpolicy: > type: binpacking > preemption: > enabled: false > placementrules: > - name: tag > value: namespace > create: false > queues: > - name: root > submitacl: '*' > queues: > - name: c > resources: > guaranteed: > memory: 13000Gi > vcore: 3250 > max: > memory: 13000Gi > vcore: 3250 > properties: > application.sort.policy: fair > - name: e > resources: > guaranteed: > memory: 2600Gi > vcore: 650 > max: > memory: 2600Gi > vcore: 650 > properties: > application.sort.policy: fair > - name: m1 > resources: > guaranteed: > memory: 1000Gi > vcore: 250 > max: > memory: 1000Gi > vcore: 250 > properties: > application.sort.policy: fair > - name: m2 > resources: > guaranteed: > memory: 62000Gi > vcore: 15500 > max: > memory: 62000Gi > vcore: 15500 > properties: > application.sort.policy: fair {code} > The issue is that at some point the scheduler would stop starting new > containers, and there would be 0 containers running finally and lots of > applications in Accepted status. > !Applications stuck in Accepted status.png|width=1211,height=407! > There are some logs that contains negative vcore resource, and these logs are > highly corralated with this issue in timeline. > {code:java} > 2024-07-08T10:19:13.436Z INFO core.scheduler > scheduler/scheduler.go:101 Found outstanding requests that will trigger > autoscaling {"number of requests": 1, "total resources": > "map[memory:2147483648 pods:1 vcore:500]"} > E0708 10:19:13.604563 1 event_broadcaster.go:270] "Server rejected > event (will not retry!)" err="Event > \"example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76\" is > invalid: [action: Required value, reason: Required value]" > event="&Event{ObjectMeta:{example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76 > c 0 0001-01-01 00:00:00 + UTC map[] map[] [] [] > []},EventTime:2024-07-08 10:19:13.60205325 + UTC > m=+524778.657618517,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod > c example-job-1720433945-574-aa32179091daba13-driver > a1901455-3a23-4dbc-bbb2-9dd2dfc775ea v1 201821875 },Related:nil,Note:Request > 'a1901455-3a23-4dbc-bbb2-9dd2dfc775ea' does not fit in queue 'root.c' > (requested map[memory:2147483648 pods:1 vcore:500], available > map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 > hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 > vcore:-56150]),Type:Normal,DeprecatedSource:{ > },DeprecatedFirstTimestamp:0001-01-01 00:00:00 + > UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 + UTC,DeprecatedCount:0,}" > 2024-07-08T10:19:05.391Z INFO core.scheduler > scheduler/
[jira] [Commented] (YUNIKORN-2629) Adding a node can result in a deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17864029#comment-17864029 ] Xi Chen commented on YUNIKORN-2629: --- [~wilfreds] Thanks for looking into this! I opened a new ticket: YUNIKORN-2731 > Adding a node can result in a deadlock > -- > > Key: YUNIKORN-2629 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.5.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > Labels: pull-request-available > Fix For: 1.5.2 > > Attachments: updateNode_deadlock_trace.txt, > yunikorn-scheduler-20240627.log, yunikorn_stuck_stack_20240708.txt > > > Adding a new node after Yunikorn state initialization can result in a > deadlock. > The problem is that {{Context.addNode()}} holds a lock while we're waiting > for the {{NodeAccepted}} event: > {noformat} >dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, > func(event interface{}) { > nodeEvent, ok := event.(CachedSchedulerNodeEvent) > if !ok { > return > } > [...] removed for clarity > wg.Done() > }) > defer dispatcher.UnregisterEventHandler(handlerID, > dispatcher.EventTypeNode) > if err := > ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode(&si.NodeRequest{ > Nodes: nodesToRegister, > RmID: schedulerconf.GetSchedulerConf().ClusterID, > }); err != nil { > log.Log(log.ShimContext).Error("Failed to register nodes", > zap.Error(err)) > return nil, err > } > // wait for all responses to accumulate > wg.Wait() <--- shim gets stuck here > {noformat} > If tasks are being processed, then the dispatcher will try to retrieve the > evend handler, which is returned from Context: > {noformat} > go func() { > for { > select { > case event := <-getDispatcher().eventChan: > switch v := event.(type) { > case events.TaskEvent: > getEventHandler(EventTypeTask)(v) <--- > eventually calls Context.getTask() > case events.ApplicationEvent: > getEventHandler(EventTypeApp)(v) > case events.SchedulerNodeEvent: > getEventHandler(EventTypeNode)(v) > {noformat} > Since {{addNode()}} is holding a write lock, the event processing loop gets > stuck, so {{registerNodes()}} will never progress. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2731) YuniKorn stopped scheduling new containers with negative vcore in queue
[ https://issues.apache.org/jira/browse/YUNIKORN-2731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xi Chen updated YUNIKORN-2731: -- Description: We have encountered this issue in one of our clusters every a few days. We are running a version that is built from branch [https://github.com/apache/yunikorn-k8shim/commits/branch-1.5/] commit fb4e3f11345e6a9866dfaea97770c94b9421807b. Here is our configuration of queues.yaml. {code:java} partitions: - name: default nodesortpolicy: type: binpacking preemption: enabled: false placementrules: - name: tag value: namespace create: false queues: - name: root submitacl: '*' queues: - name: c resources: guaranteed: memory: 13000Gi vcore: 3250 max: memory: 13000Gi vcore: 3250 properties: application.sort.policy: fair - name: e resources: guaranteed: memory: 2600Gi vcore: 650 max: memory: 2600Gi vcore: 650 properties: application.sort.policy: fair - name: m1 resources: guaranteed: memory: 1000Gi vcore: 250 max: memory: 1000Gi vcore: 250 properties: application.sort.policy: fair - name: m2 resources: guaranteed: memory: 62000Gi vcore: 15500 max: memory: 62000Gi vcore: 15500 properties: application.sort.policy: fair {code} The issue is that at some point the scheduler would stop starting new containers, and there would be 0 containers running finally and lots of applications in Accepted status. !Applications stuck in Accepted status.png|width=1211,height=407! There are some logs that contains negative vcore resource, and these logs are highly corralated with this issue in timeline. {code:java} 2024-07-08T10:19:13.436Z INFO core.scheduler scheduler/scheduler.go:101 Found outstanding requests that will trigger autoscaling {"number of requests": 1, "total resources": "map[memory:2147483648 pods:1 vcore:500]"} E0708 10:19:13.604563 1 event_broadcaster.go:270] "Server rejected event (will not retry!)" err="Event \"example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76\" is invalid: [action: Required value, reason: Required value]" event="&Event{ObjectMeta:{example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76 c 0 0001-01-01 00:00:00 + UTC map[] map[] [] [] []},EventTime:2024-07-08 10:19:13.60205325 + UTC m=+524778.657618517,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod c example-job-1720433945-574-aa32179091daba13-driver a1901455-3a23-4dbc-bbb2-9dd2dfc775ea v1 201821875 },Related:nil,Note:Request 'a1901455-3a23-4dbc-bbb2-9dd2dfc775ea' does not fit in queue 'root.c' (requested map[memory:2147483648 pods:1 vcore:500], available map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 vcore:-56150]),Type:Normal,DeprecatedSource:{ },DeprecatedFirstTimestamp:0001-01-01 00:00:00 + UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 + UTC,DeprecatedCount:0,}" 2024-07-08T10:19:05.391Z INFO core.scheduler scheduler/scheduler.go:101 Found outstanding requests that will trigger autoscaling {"number of requests": 1, "total resources": "map[memory:2147483648 pods:1 vcore:500]"} E0708 10:19:05.601679 1 event_broadcaster.go:270] "Server rejected event (will not retry!)" err="Event \"example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4\" is invalid: [action: Required value, reason: Required value]" event="&Event{ObjectMeta:{example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4 e 0 0001-01-01 00:00:00 + UTC map[] map[] [] [] []},EventTime:2024-07-08 10:19:05.599216316 + UTC m=+524770.654781585,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod e example-job-1720433937-295-e7b2229091da99a7-driver 14a40bac-5e89-4293-bcb7-936c544694a2 v1 201821666 },Related:nil,Note:Request '14a40bac-5e89-4293-bcb7-936c544694a2' does not fit in queue 'root.e' (requested map[memory:2147483648 pods:1 vcore:500], available map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 vcore:-56150]),Type:Normal,DeprecatedSource:{ },DeprecatedFirstTi
[jira] [Updated] (YUNIKORN-2731) YuniKorn stopped scheduling new containers with negative vcore in queue
[ https://issues.apache.org/jira/browse/YUNIKORN-2731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xi Chen updated YUNIKORN-2731: -- Attachment: (was: Screenshot 2024-07-09 at 2.40.51 PM.png) > YuniKorn stopped scheduling new containers with negative vcore in queue > --- > > Key: YUNIKORN-2731 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2731 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Affects Versions: 1.5.1 >Reporter: Xi Chen >Priority: Major > Attachments: Applications stuck in Accepted status.png > > > We have encountered this issue in one of our clusters every a few days. We > are running a version that is built from branch > [https://github.com/apache/yunikorn-k8shim/commits/branch-1.5/] commit > fb4e3f11345e6a9866dfaea97770c94b9421807b. > Here is our configuration of queues.yaml. > > {code:java} > partitions: > - name: default > nodesortpolicy: > type: binpacking > preemption: > enabled: false > placementrules: > - name: tag > value: namespace > create: false > queues: > - name: root > submitacl: '*' > queues: > - name: c > resources: > guaranteed: > memory: 13000Gi > vcore: 3250 > max: > memory: 13000Gi > vcore: 3250 > properties: > application.sort.policy: fair > - name: e > resources: > guaranteed: > memory: 2600Gi > vcore: 650 > max: > memory: 2600Gi > vcore: 650 > properties: > application.sort.policy: fair > - name: m1 > resources: > guaranteed: > memory: 1000Gi > vcore: 250 > max: > memory: 1000Gi > vcore: 250 > properties: > application.sort.policy: fair > - name: m2 > resources: > guaranteed: > memory: 62000Gi > vcore: 15500 > max: > memory: 62000Gi > vcore: 15500 > properties: > application.sort.policy: fair {code} > The issue is that at some point the scheduler would stop starting new > containers, and there would be 0 containers running finally and lots of > applications in Accepted status. > > There are some logs that contains negative vcore resource, and these logs are > highly corralated with this issue in timeline. > > {code:java} > 2024-07-08T10:19:13.436Z INFO core.scheduler > scheduler/scheduler.go:101 Found outstanding requests that will trigger > autoscaling {"number of requests": 1, "total resources": > "map[memory:2147483648 pods:1 vcore:500]"} > E0708 10:19:13.604563 1 event_broadcaster.go:270] "Server rejected > event (will not retry!)" err="Event > \"example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76\" is > invalid: [action: Required value, reason: Required value]" > event="&Event{ObjectMeta:{example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76 > c 0 0001-01-01 00:00:00 + UTC map[] map[] [] [] > []},EventTime:2024-07-08 10:19:13.60205325 + UTC > m=+524778.657618517,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod > c example-job-1720433945-574-aa32179091daba13-driver > a1901455-3a23-4dbc-bbb2-9dd2dfc775ea v1 201821875 },Related:nil,Note:Request > 'a1901455-3a23-4dbc-bbb2-9dd2dfc775ea' does not fit in queue 'root.c' > (requested map[memory:2147483648 pods:1 vcore:500], available > map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 > hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 > vcore:-56150]),Type:Normal,DeprecatedSource:{ > },DeprecatedFirstTimestamp:0001-01-01 00:00:00 + > UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 + UTC,DeprecatedCount:0,}" > 2024-07-08T10:19:05.391Z INFO core.scheduler > scheduler/scheduler.go:101 Found outstanding requests that will trigger > autoscaling {"number of requests": 1, "total resources": > "map[memory:2147483648 pods:1 vcore:500]"} > E0708 10:19:05.601679 1 event_broadcaster.go:270] "Server rejected > event (will not retry!)" err="Event > \"example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4\" is > invalid: [action: Required value, reason: Required value]" > event="&Event{ObjectMeta:{example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4 > e 0 0001-01-01 00:0
[jira] [Updated] (YUNIKORN-2731) YuniKorn stopped scheduling new containers with negative vcore in queue
[ https://issues.apache.org/jira/browse/YUNIKORN-2731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xi Chen updated YUNIKORN-2731: -- Attachment: Screenshot 2024-07-09 at 2.40.51 PM.png > YuniKorn stopped scheduling new containers with negative vcore in queue > --- > > Key: YUNIKORN-2731 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2731 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Affects Versions: 1.5.1 >Reporter: Xi Chen >Priority: Major > Attachments: Applications stuck in Accepted status.png > > > We have encountered this issue in one of our clusters every a few days. We > are running a version that is built from branch > [https://github.com/apache/yunikorn-k8shim/commits/branch-1.5/] commit > fb4e3f11345e6a9866dfaea97770c94b9421807b. > Here is our configuration of queues.yaml. > > {code:java} > partitions: > - name: default > nodesortpolicy: > type: binpacking > preemption: > enabled: false > placementrules: > - name: tag > value: namespace > create: false > queues: > - name: root > submitacl: '*' > queues: > - name: c > resources: > guaranteed: > memory: 13000Gi > vcore: 3250 > max: > memory: 13000Gi > vcore: 3250 > properties: > application.sort.policy: fair > - name: e > resources: > guaranteed: > memory: 2600Gi > vcore: 650 > max: > memory: 2600Gi > vcore: 650 > properties: > application.sort.policy: fair > - name: m1 > resources: > guaranteed: > memory: 1000Gi > vcore: 250 > max: > memory: 1000Gi > vcore: 250 > properties: > application.sort.policy: fair > - name: m2 > resources: > guaranteed: > memory: 62000Gi > vcore: 15500 > max: > memory: 62000Gi > vcore: 15500 > properties: > application.sort.policy: fair {code} > The issue is that at some point the scheduler would stop starting new > containers, and there would be 0 containers running finally and lots of > applications in Accepted status. > > There are some logs that contains negative vcore resource, and these logs are > highly corralated with this issue in timeline. > > {code:java} > 2024-07-08T10:19:13.436Z INFO core.scheduler > scheduler/scheduler.go:101 Found outstanding requests that will trigger > autoscaling {"number of requests": 1, "total resources": > "map[memory:2147483648 pods:1 vcore:500]"} > E0708 10:19:13.604563 1 event_broadcaster.go:270] "Server rejected > event (will not retry!)" err="Event > \"example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76\" is > invalid: [action: Required value, reason: Required value]" > event="&Event{ObjectMeta:{example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76 > c 0 0001-01-01 00:00:00 + UTC map[] map[] [] [] > []},EventTime:2024-07-08 10:19:13.60205325 + UTC > m=+524778.657618517,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod > c example-job-1720433945-574-aa32179091daba13-driver > a1901455-3a23-4dbc-bbb2-9dd2dfc775ea v1 201821875 },Related:nil,Note:Request > 'a1901455-3a23-4dbc-bbb2-9dd2dfc775ea' does not fit in queue 'root.c' > (requested map[memory:2147483648 pods:1 vcore:500], available > map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 > hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 > vcore:-56150]),Type:Normal,DeprecatedSource:{ > },DeprecatedFirstTimestamp:0001-01-01 00:00:00 + > UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 + UTC,DeprecatedCount:0,}" > 2024-07-08T10:19:05.391Z INFO core.scheduler > scheduler/scheduler.go:101 Found outstanding requests that will trigger > autoscaling {"number of requests": 1, "total resources": > "map[memory:2147483648 pods:1 vcore:500]"} > E0708 10:19:05.601679 1 event_broadcaster.go:270] "Server rejected > event (will not retry!)" err="Event > \"example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4\" is > invalid: [action: Required value, reason: Required value]" > event="&Event{ObjectMeta:{example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4 > e 0 0001-01-01 00:00:00 +
[jira] [Updated] (YUNIKORN-2731) YuniKorn stopped scheduling new containers with negative vcore in queue
[ https://issues.apache.org/jira/browse/YUNIKORN-2731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xi Chen updated YUNIKORN-2731: -- Attachment: Applications stuck in Accepted status.png > YuniKorn stopped scheduling new containers with negative vcore in queue > --- > > Key: YUNIKORN-2731 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2731 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Affects Versions: 1.5.1 >Reporter: Xi Chen >Priority: Major > Attachments: Applications stuck in Accepted status.png > > > We have encountered this issue in one of our clusters every a few days. We > are running a version that is built from branch > [https://github.com/apache/yunikorn-k8shim/commits/branch-1.5/] commit > fb4e3f11345e6a9866dfaea97770c94b9421807b. > Here is our configuration of queues.yaml. > > {code:java} > partitions: > - name: default > nodesortpolicy: > type: binpacking > preemption: > enabled: false > placementrules: > - name: tag > value: namespace > create: false > queues: > - name: root > submitacl: '*' > queues: > - name: c > resources: > guaranteed: > memory: 13000Gi > vcore: 3250 > max: > memory: 13000Gi > vcore: 3250 > properties: > application.sort.policy: fair > - name: e > resources: > guaranteed: > memory: 2600Gi > vcore: 650 > max: > memory: 2600Gi > vcore: 650 > properties: > application.sort.policy: fair > - name: m1 > resources: > guaranteed: > memory: 1000Gi > vcore: 250 > max: > memory: 1000Gi > vcore: 250 > properties: > application.sort.policy: fair > - name: m2 > resources: > guaranteed: > memory: 62000Gi > vcore: 15500 > max: > memory: 62000Gi > vcore: 15500 > properties: > application.sort.policy: fair {code} > The issue is that at some point the scheduler would stop starting new > containers, and there would be 0 containers running finally and lots of > applications in Accepted status. > > There are some logs that contains negative vcore resource, and these logs are > highly corralated with this issue in timeline. > > {code:java} > 2024-07-08T10:19:13.436Z INFO core.scheduler > scheduler/scheduler.go:101 Found outstanding requests that will trigger > autoscaling {"number of requests": 1, "total resources": > "map[memory:2147483648 pods:1 vcore:500]"} > E0708 10:19:13.604563 1 event_broadcaster.go:270] "Server rejected > event (will not retry!)" err="Event > \"example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76\" is > invalid: [action: Required value, reason: Required value]" > event="&Event{ObjectMeta:{example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76 > c 0 0001-01-01 00:00:00 + UTC map[] map[] [] [] > []},EventTime:2024-07-08 10:19:13.60205325 + UTC > m=+524778.657618517,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod > c example-job-1720433945-574-aa32179091daba13-driver > a1901455-3a23-4dbc-bbb2-9dd2dfc775ea v1 201821875 },Related:nil,Note:Request > 'a1901455-3a23-4dbc-bbb2-9dd2dfc775ea' does not fit in queue 'root.c' > (requested map[memory:2147483648 pods:1 vcore:500], available > map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 > hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 > vcore:-56150]),Type:Normal,DeprecatedSource:{ > },DeprecatedFirstTimestamp:0001-01-01 00:00:00 + > UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 + UTC,DeprecatedCount:0,}" > 2024-07-08T10:19:05.391Z INFO core.scheduler > scheduler/scheduler.go:101 Found outstanding requests that will trigger > autoscaling {"number of requests": 1, "total resources": > "map[memory:2147483648 pods:1 vcore:500]"} > E0708 10:19:05.601679 1 event_broadcaster.go:270] "Server rejected > event (will not retry!)" err="Event > \"example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4\" is > invalid: [action: Required value, reason: Required value]" > event="&Event{ObjectMeta:{example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4 > e 0 0001-01-01 00:00:00 +000
[jira] [Created] (YUNIKORN-2731) YuniKorn stopped scheduling new containers with negative vcore in queue
Xi Chen created YUNIKORN-2731: - Summary: YuniKorn stopped scheduling new containers with negative vcore in queue Key: YUNIKORN-2731 URL: https://issues.apache.org/jira/browse/YUNIKORN-2731 Project: Apache YuniKorn Issue Type: Bug Components: core - scheduler Affects Versions: 1.5.1 Reporter: Xi Chen We have encountered this issue in one of our clusters every a few days. We are running a version that is built from branch [https://github.com/apache/yunikorn-k8shim/commits/branch-1.5/] commit fb4e3f11345e6a9866dfaea97770c94b9421807b. Here is our configuration of queues.yaml. {code:java} partitions: - name: default nodesortpolicy: type: binpacking preemption: enabled: false placementrules: - name: tag value: namespace create: false queues: - name: root submitacl: '*' queues: - name: c resources: guaranteed: memory: 13000Gi vcore: 3250 max: memory: 13000Gi vcore: 3250 properties: application.sort.policy: fair - name: e resources: guaranteed: memory: 2600Gi vcore: 650 max: memory: 2600Gi vcore: 650 properties: application.sort.policy: fair - name: m1 resources: guaranteed: memory: 1000Gi vcore: 250 max: memory: 1000Gi vcore: 250 properties: application.sort.policy: fair - name: m2 resources: guaranteed: memory: 62000Gi vcore: 15500 max: memory: 62000Gi vcore: 15500 properties: application.sort.policy: fair {code} The issue is that at some point the scheduler would stop starting new containers, and there would be 0 containers running finally and lots of applications in Accepted status. There are some logs that contains negative vcore resource, and these logs are highly corralated with this issue in timeline. {code:java} 2024-07-08T10:19:13.436Z INFO core.scheduler scheduler/scheduler.go:101 Found outstanding requests that will trigger autoscaling {"number of requests": 1, "total resources": "map[memory:2147483648 pods:1 vcore:500]"} E0708 10:19:13.604563 1 event_broadcaster.go:270] "Server rejected event (will not retry!)" err="Event \"example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76\" is invalid: [action: Required value, reason: Required value]" event="&Event{ObjectMeta:{example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76 c 0 0001-01-01 00:00:00 + UTC map[] map[] [] [] []},EventTime:2024-07-08 10:19:13.60205325 + UTC m=+524778.657618517,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod c example-job-1720433945-574-aa32179091daba13-driver a1901455-3a23-4dbc-bbb2-9dd2dfc775ea v1 201821875 },Related:nil,Note:Request 'a1901455-3a23-4dbc-bbb2-9dd2dfc775ea' does not fit in queue 'root.c' (requested map[memory:2147483648 pods:1 vcore:500], available map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 vcore:-56150]),Type:Normal,DeprecatedSource:{ },DeprecatedFirstTimestamp:0001-01-01 00:00:00 + UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 + UTC,DeprecatedCount:0,}" 2024-07-08T10:19:05.391Z INFO core.scheduler scheduler/scheduler.go:101 Found outstanding requests that will trigger autoscaling {"number of requests": 1, "total resources": "map[memory:2147483648 pods:1 vcore:500]"} E0708 10:19:05.601679 1 event_broadcaster.go:270] "Server rejected event (will not retry!)" err="Event \"example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4\" is invalid: [action: Required value, reason: Required value]" event="&Event{ObjectMeta:{example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4 e 0 0001-01-01 00:00:00 + UTC map[] map[] [] [] []},EventTime:2024-07-08 10:19:05.599216316 + UTC m=+524770.654781585,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod e example-job-1720433937-295-e7b2229091da99a7-driver 14a40bac-5e89-4293-bcb7-936c544694a2 v1 201821666 },Related:nil,Note:Request '14a40bac-5e89-4293-bcb7-936c544694a2' does not fit in queue 'root.e' (requested map[memory:2147483648 pods:1 vcore:500], available map[ephemeral-storage:2689798906768
[jira] [Commented] (YUNIKORN-2629) Adding a node can result in a deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17863801#comment-17863801 ] Xi Chen commented on YUNIKORN-2629: --- [~pbacsko] This is the latest stack we got from the YuniKorn stopped scheduling issue: [^yunikorn_stuck_stack_20240708.txt] > Adding a node can result in a deadlock > -- > > Key: YUNIKORN-2629 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.5.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > Labels: pull-request-available > Fix For: 1.5.2 > > Attachments: updateNode_deadlock_trace.txt, > yunikorn-scheduler-20240627.log, yunikorn_stuck_stack_20240708.txt > > > Adding a new node after Yunikorn state initialization can result in a > deadlock. > The problem is that {{Context.addNode()}} holds a lock while we're waiting > for the {{NodeAccepted}} event: > {noformat} >dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, > func(event interface{}) { > nodeEvent, ok := event.(CachedSchedulerNodeEvent) > if !ok { > return > } > [...] removed for clarity > wg.Done() > }) > defer dispatcher.UnregisterEventHandler(handlerID, > dispatcher.EventTypeNode) > if err := > ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode(&si.NodeRequest{ > Nodes: nodesToRegister, > RmID: schedulerconf.GetSchedulerConf().ClusterID, > }); err != nil { > log.Log(log.ShimContext).Error("Failed to register nodes", > zap.Error(err)) > return nil, err > } > // wait for all responses to accumulate > wg.Wait() <--- shim gets stuck here > {noformat} > If tasks are being processed, then the dispatcher will try to retrieve the > evend handler, which is returned from Context: > {noformat} > go func() { > for { > select { > case event := <-getDispatcher().eventChan: > switch v := event.(type) { > case events.TaskEvent: > getEventHandler(EventTypeTask)(v) <--- > eventually calls Context.getTask() > case events.ApplicationEvent: > getEventHandler(EventTypeApp)(v) > case events.SchedulerNodeEvent: > getEventHandler(EventTypeNode)(v) > {noformat} > Since {{addNode()}} is holding a write lock, the event processing loop gets > stuck, so {{registerNodes()}} will never progress. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2629) Adding a node can result in a deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xi Chen updated YUNIKORN-2629: -- Attachment: yunikorn_stuck_stack_20240708.txt > Adding a node can result in a deadlock > -- > > Key: YUNIKORN-2629 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.5.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > Labels: pull-request-available > Fix For: 1.5.2 > > Attachments: updateNode_deadlock_trace.txt, > yunikorn-scheduler-20240627.log, yunikorn_stuck_stack_20240708.txt > > > Adding a new node after Yunikorn state initialization can result in a > deadlock. > The problem is that {{Context.addNode()}} holds a lock while we're waiting > for the {{NodeAccepted}} event: > {noformat} >dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, > func(event interface{}) { > nodeEvent, ok := event.(CachedSchedulerNodeEvent) > if !ok { > return > } > [...] removed for clarity > wg.Done() > }) > defer dispatcher.UnregisterEventHandler(handlerID, > dispatcher.EventTypeNode) > if err := > ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode(&si.NodeRequest{ > Nodes: nodesToRegister, > RmID: schedulerconf.GetSchedulerConf().ClusterID, > }); err != nil { > log.Log(log.ShimContext).Error("Failed to register nodes", > zap.Error(err)) > return nil, err > } > // wait for all responses to accumulate > wg.Wait() <--- shim gets stuck here > {noformat} > If tasks are being processed, then the dispatcher will try to retrieve the > evend handler, which is returned from Context: > {noformat} > go func() { > for { > select { > case event := <-getDispatcher().eventChan: > switch v := event.(type) { > case events.TaskEvent: > getEventHandler(EventTypeTask)(v) <--- > eventually calls Context.getTask() > case events.ApplicationEvent: > getEventHandler(EventTypeApp)(v) > case events.SchedulerNodeEvent: > getEventHandler(EventTypeNode)(v) > {noformat} > Since {{addNode()}} is holding a write lock, the event processing loop gets > stuck, so {{registerNodes()}} will never progress. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2629) Adding a node can result in a deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17862915#comment-17862915 ] Xi Chen commented on YUNIKORN-2629: --- [~pbacsko] Thanks for the suggestion! We'll run it without preemption. > Adding a node can result in a deadlock > -- > > Key: YUNIKORN-2629 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.5.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > Labels: pull-request-available > Fix For: 1.5.2 > > Attachments: updateNode_deadlock_trace.txt, > yunikorn-scheduler-20240627.log > > > Adding a new node after Yunikorn state initialization can result in a > deadlock. > The problem is that {{Context.addNode()}} holds a lock while we're waiting > for the {{NodeAccepted}} event: > {noformat} >dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, > func(event interface{}) { > nodeEvent, ok := event.(CachedSchedulerNodeEvent) > if !ok { > return > } > [...] removed for clarity > wg.Done() > }) > defer dispatcher.UnregisterEventHandler(handlerID, > dispatcher.EventTypeNode) > if err := > ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode(&si.NodeRequest{ > Nodes: nodesToRegister, > RmID: schedulerconf.GetSchedulerConf().ClusterID, > }); err != nil { > log.Log(log.ShimContext).Error("Failed to register nodes", > zap.Error(err)) > return nil, err > } > // wait for all responses to accumulate > wg.Wait() <--- shim gets stuck here > {noformat} > If tasks are being processed, then the dispatcher will try to retrieve the > evend handler, which is returned from Context: > {noformat} > go func() { > for { > select { > case event := <-getDispatcher().eventChan: > switch v := event.(type) { > case events.TaskEvent: > getEventHandler(EventTypeTask)(v) <--- > eventually calls Context.getTask() > case events.ApplicationEvent: > getEventHandler(EventTypeApp)(v) > case events.SchedulerNodeEvent: > getEventHandler(EventTypeNode)(v) > {noformat} > Since {{addNode()}} is holding a write lock, the event processing loop gets > stuck, so {{registerNodes()}} will never progress. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2629) Adding a node can result in a deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17862843#comment-17862843 ] Xi Chen commented on YUNIKORN-2629: --- Good to know it's a false positive. But there was a real issue that the scheduler stopped scheduling new containers until restart. There were some WARN messages before the scheduler started to hang and print the deadlock messages: {code:java} 2024-07-02T08:10:15.418Z WARN core.metrics metrics/metrics_collector.go:90 Could not calculate the totalContainersRunning. {"allocatedContainers": 895708, "releasedContainers": 895720} 2024-07-02T08:10:15.420Z WARN core.scheduler.health scheduler/health_checker.go:178 Scheduler is not healthy {"name": "Consistency of data", "description": "Check if a partition's allocated resource <= total resource of the partition", "message": "Partitions with inconsistent data: [\"[foo-cluster-name]default\"]"} {code} [~pbacsko] Could you tell me what else information is needed for debugging? I can collect them if it happens again. > Adding a node can result in a deadlock > -- > > Key: YUNIKORN-2629 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.5.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > Labels: pull-request-available > Fix For: 1.5.2 > > Attachments: updateNode_deadlock_trace.txt, > yunikorn-scheduler-20240627.log > > > Adding a new node after Yunikorn state initialization can result in a > deadlock. > The problem is that {{Context.addNode()}} holds a lock while we're waiting > for the {{NodeAccepted}} event: > {noformat} >dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, > func(event interface{}) { > nodeEvent, ok := event.(CachedSchedulerNodeEvent) > if !ok { > return > } > [...] removed for clarity > wg.Done() > }) > defer dispatcher.UnregisterEventHandler(handlerID, > dispatcher.EventTypeNode) > if err := > ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode(&si.NodeRequest{ > Nodes: nodesToRegister, > RmID: schedulerconf.GetSchedulerConf().ClusterID, > }); err != nil { > log.Log(log.ShimContext).Error("Failed to register nodes", > zap.Error(err)) > return nil, err > } > // wait for all responses to accumulate > wg.Wait() <--- shim gets stuck here > {noformat} > If tasks are being processed, then the dispatcher will try to retrieve the > evend handler, which is returned from Context: > {noformat} > go func() { > for { > select { > case event := <-getDispatcher().eventChan: > switch v := event.(type) { > case events.TaskEvent: > getEventHandler(EventTypeTask)(v) <--- > eventually calls Context.getTask() > case events.ApplicationEvent: > getEventHandler(EventTypeApp)(v) > case events.SchedulerNodeEvent: > getEventHandler(EventTypeNode)(v) > {noformat} > Since {{addNode()}} is holding a write lock, the event processing loop gets > stuck, so {{registerNodes()}} will never progress. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Comment Edited] (YUNIKORN-2629) Adding a node can result in a deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17862661#comment-17862661 ] Xi Chen edited comment on YUNIKORN-2629 at 7/3/24 6:00 AM: --- [~pbacsko] Hi, I don't know if this is the right place, but we've found a potential new issue of deadlock. It happened 2 times during the past week in one of our production environment. We are using different queues for different namespaces, and binpacking policy enabled. The deadlock detection output is uploaded [^yunikorn-scheduler-20240627.log]. We are running the scheduler built from this branch with the fix for this ticket (early version of branch-1.5): [https://github.com/apache/yunikorn-k8shim/tree/fb4e3f11345e6a9866dfaea97770c94b9421807b] Please let me know if this should be a new Jira ticket, thanks! was (Author: jshmchenxi): Hi, I don't know if this is the right place, but we've found a potential new issue of deadlock. It happened 2 times during the past week in one of our production environment. We are using different queues for different namespaces, and binpacking policy enabled. The deadlock detection output is uploaded [^yunikorn-scheduler-20240627.log]. We are running the scheduler built from this branch with the fix for this ticket (early version of branch-1.5): [https://github.com/apache/yunikorn-k8shim/tree/fb4e3f11345e6a9866dfaea97770c94b9421807b] > Adding a node can result in a deadlock > -- > > Key: YUNIKORN-2629 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.5.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > Labels: pull-request-available > Fix For: 1.5.2 > > Attachments: updateNode_deadlock_trace.txt, > yunikorn-scheduler-20240627.log > > > Adding a new node after Yunikorn state initialization can result in a > deadlock. > The problem is that {{Context.addNode()}} holds a lock while we're waiting > for the {{NodeAccepted}} event: > {noformat} >dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, > func(event interface{}) { > nodeEvent, ok := event.(CachedSchedulerNodeEvent) > if !ok { > return > } > [...] removed for clarity > wg.Done() > }) > defer dispatcher.UnregisterEventHandler(handlerID, > dispatcher.EventTypeNode) > if err := > ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode(&si.NodeRequest{ > Nodes: nodesToRegister, > RmID: schedulerconf.GetSchedulerConf().ClusterID, > }); err != nil { > log.Log(log.ShimContext).Error("Failed to register nodes", > zap.Error(err)) > return nil, err > } > // wait for all responses to accumulate > wg.Wait() <--- shim gets stuck here > {noformat} > If tasks are being processed, then the dispatcher will try to retrieve the > evend handler, which is returned from Context: > {noformat} > go func() { > for { > select { > case event := <-getDispatcher().eventChan: > switch v := event.(type) { > case events.TaskEvent: > getEventHandler(EventTypeTask)(v) <--- > eventually calls Context.getTask() > case events.ApplicationEvent: > getEventHandler(EventTypeApp)(v) > case events.SchedulerNodeEvent: > getEventHandler(EventTypeNode)(v) > {noformat} > Since {{addNode()}} is holding a write lock, the event processing loop gets > stuck, so {{registerNodes()}} will never progress. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2629) Adding a node can result in a deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17862661#comment-17862661 ] Xi Chen commented on YUNIKORN-2629: --- Hi, I don't know if this is the right place, but we've found a potential new issue of deadlock. It happened 2 times during the past week in one of our production environment. We are using different queues for different namespaces, and binpacking policy enabled. The deadlock detection output is uploaded [^yunikorn-scheduler-20240627.log]. We are running the scheduler built from this branch with the fix for this ticket (early version of branch-1.5): [https://github.com/apache/yunikorn-k8shim/tree/fb4e3f11345e6a9866dfaea97770c94b9421807b] > Adding a node can result in a deadlock > -- > > Key: YUNIKORN-2629 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.5.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > Labels: pull-request-available > Fix For: 1.5.2 > > Attachments: updateNode_deadlock_trace.txt, > yunikorn-scheduler-20240627.log > > > Adding a new node after Yunikorn state initialization can result in a > deadlock. > The problem is that {{Context.addNode()}} holds a lock while we're waiting > for the {{NodeAccepted}} event: > {noformat} >dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, > func(event interface{}) { > nodeEvent, ok := event.(CachedSchedulerNodeEvent) > if !ok { > return > } > [...] removed for clarity > wg.Done() > }) > defer dispatcher.UnregisterEventHandler(handlerID, > dispatcher.EventTypeNode) > if err := > ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode(&si.NodeRequest{ > Nodes: nodesToRegister, > RmID: schedulerconf.GetSchedulerConf().ClusterID, > }); err != nil { > log.Log(log.ShimContext).Error("Failed to register nodes", > zap.Error(err)) > return nil, err > } > // wait for all responses to accumulate > wg.Wait() <--- shim gets stuck here > {noformat} > If tasks are being processed, then the dispatcher will try to retrieve the > evend handler, which is returned from Context: > {noformat} > go func() { > for { > select { > case event := <-getDispatcher().eventChan: > switch v := event.(type) { > case events.TaskEvent: > getEventHandler(EventTypeTask)(v) <--- > eventually calls Context.getTask() > case events.ApplicationEvent: > getEventHandler(EventTypeApp)(v) > case events.SchedulerNodeEvent: > getEventHandler(EventTypeNode)(v) > {noformat} > Since {{addNode()}} is holding a write lock, the event processing loop gets > stuck, so {{registerNodes()}} will never progress. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2629) Adding a node can result in a deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xi Chen updated YUNIKORN-2629: -- Attachment: yunikorn-scheduler-20240627.log > Adding a node can result in a deadlock > -- > > Key: YUNIKORN-2629 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.5.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > Labels: pull-request-available > Fix For: 1.5.2 > > Attachments: updateNode_deadlock_trace.txt, > yunikorn-scheduler-20240627.log > > > Adding a new node after Yunikorn state initialization can result in a > deadlock. > The problem is that {{Context.addNode()}} holds a lock while we're waiting > for the {{NodeAccepted}} event: > {noformat} >dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, > func(event interface{}) { > nodeEvent, ok := event.(CachedSchedulerNodeEvent) > if !ok { > return > } > [...] removed for clarity > wg.Done() > }) > defer dispatcher.UnregisterEventHandler(handlerID, > dispatcher.EventTypeNode) > if err := > ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode(&si.NodeRequest{ > Nodes: nodesToRegister, > RmID: schedulerconf.GetSchedulerConf().ClusterID, > }); err != nil { > log.Log(log.ShimContext).Error("Failed to register nodes", > zap.Error(err)) > return nil, err > } > // wait for all responses to accumulate > wg.Wait() <--- shim gets stuck here > {noformat} > If tasks are being processed, then the dispatcher will try to retrieve the > evend handler, which is returned from Context: > {noformat} > go func() { > for { > select { > case event := <-getDispatcher().eventChan: > switch v := event.(type) { > case events.TaskEvent: > getEventHandler(EventTypeTask)(v) <--- > eventually calls Context.getTask() > case events.ApplicationEvent: > getEventHandler(EventTypeApp)(v) > case events.SchedulerNodeEvent: > getEventHandler(EventTypeNode)(v) > {noformat} > Since {{addNode()}} is holding a write lock, the event processing loop gets > stuck, so {{registerNodes()}} will never progress. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2521) Scheduler deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841442#comment-17841442 ] Xi Chen commented on YUNIKORN-2521: --- No new stacks were found other than the preemption. Thanks for fixing this! [~pbacsko] [~ccondit] > Scheduler deadlock > -- > > Key: YUNIKORN-2521 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2521 > Project: Apache YuniKorn > Issue Type: Bug >Affects Versions: 1.5.0 > Environment: Yunikorn: 1.5 > AWS EKS: v1.28.6-eks-508b6b3 >Reporter: Noah Yoshida >Assignee: Peter Bacsko >Priority: Critical > Fix For: 1.6.0, 1.5.1 > > Attachments: 0001-YUNIKORN-2539-core.patch, > 0002-YUNIKORN-2539-k8shim.patch, 4_4_goroutine-1.txt, 4_4_goroutine-2.txt, > 4_4_goroutine-3.txt, 4_4_goroutine-4.txt, 4_4_goroutine-5-state-dump.txt, > 4_4_profile001.png, 4_4_profile002.png, 4_4_profile003.png, > 4_4_scheduler-logs.txt, deadlock_2024-04-18.log, goroutine-4-3-1.out, > goroutine-4-3-2.out, goroutine-4-3-3.out, goroutine-4-3.out, > goroutine-4-5.out, goroutine-dump.txt, goroutine-while-blocking-2.out, > goroutine-while-blocking.out, logs-potential-deadlock-2.txt, > logs-potential-deadlock.txt, logs-splunk-ordered.txt, logs-splunk.txt, > profile001-4-5.gif, profile012.gif, profile013.gif, running-logs-2.txt, > running-logs.txt > > > Discussion on Yunikorn slack: > [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1711048995187179] > Occasionally, Yunikorn will deadlock and prevent any new pods from starting. > All pods stay in Pending. There are no error logs inside of the Yunikorn > scheduler indicating any issue. > Additionally, the pods all have the correct annotations / labels from the > admission service, so they are at least getting put into k8s correctly. > The issue was seen intermittently on Yunikorn version 1.5 in EKS, using > version `v1.28.6-eks-508b6b3`. > At least for me, we run about 25-50 nodes and 200-400 pods. Pods and nodes > are added and removed pretty frequently as we do ML workloads. > Attached is the goroutine dump. We were not able to get a statedump as the > endpoint kept timing out. > You can fix it by restarting the Yunikorn scheduler pod. Sometimes you also > have to delete any "Pending" pods that got stuck while the scheduler was > deadlocked as well, for them to get picked up by the new scheduler pod. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Comment Edited] (YUNIKORN-2521) Scheduler deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17838821#comment-17838821 ] Xi Chen edited comment on YUNIKORN-2521 at 4/19/24 4:35 AM: [~ccondit], [~pbacsko] Thanks for looking into this! We are currently relying on the deadlock detection to restart YuniKorn. I can switch to livenessProbe using `/ws/v1/fullstatedump` to see if this really hangs the scheduler. *Update:* I switched deadlock exit to livenessProbe and the scheduler works without restarting. But the POTENTIAL DEADLOCK logs are still there. So this is likely a false positive. was (Author: jshmchenxi): [~ccondit], [~pbacsko] Thanks for looking into this! We are currently relying on the deadlock detection to restart YuniKorn. I can switch to livenessProbe using `/ws/v1/fullstatedump` to see if this really hangs the scheduler. WDYT? > Scheduler deadlock > -- > > Key: YUNIKORN-2521 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2521 > Project: Apache YuniKorn > Issue Type: Bug >Affects Versions: 1.5.0 > Environment: Yunikorn: 1.5 > AWS EKS: v1.28.6-eks-508b6b3 >Reporter: Noah Yoshida >Assignee: Craig Condit >Priority: Critical > Attachments: 0001-YUNIKORN-2539-core.patch, > 0002-YUNIKORN-2539-k8shim.patch, 4_4_goroutine-1.txt, 4_4_goroutine-2.txt, > 4_4_goroutine-3.txt, 4_4_goroutine-4.txt, 4_4_goroutine-5-state-dump.txt, > 4_4_profile001.png, 4_4_profile002.png, 4_4_profile003.png, > 4_4_scheduler-logs.txt, deadlock_2024-04-18.log, goroutine-4-3-1.out, > goroutine-4-3-2.out, goroutine-4-3-3.out, goroutine-4-3.out, > goroutine-4-5.out, goroutine-dump.txt, goroutine-while-blocking-2.out, > goroutine-while-blocking.out, logs-potential-deadlock-2.txt, > logs-potential-deadlock.txt, logs-splunk-ordered.txt, logs-splunk.txt, > profile001-4-5.gif, profile012.gif, profile013.gif, running-logs-2.txt, > running-logs.txt > > > Discussion on Yunikorn slack: > [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1711048995187179] > Occasionally, Yunikorn will deadlock and prevent any new pods from starting. > All pods stay in Pending. There are no error logs inside of the Yunikorn > scheduler indicating any issue. > Additionally, the pods all have the correct annotations / labels from the > admission service, so they are at least getting put into k8s correctly. > The issue was seen intermittently on Yunikorn version 1.5 in EKS, using > version `v1.28.6-eks-508b6b3`. > At least for me, we run about 25-50 nodes and 200-400 pods. Pods and nodes > are added and removed pretty frequently as we do ML workloads. > Attached is the goroutine dump. We were not able to get a statedump as the > endpoint kept timing out. > You can fix it by restarting the Yunikorn scheduler pod. Sometimes you also > have to delete any "Pending" pods that got stuck while the scheduler was > deadlocked as well, for them to get picked up by the new scheduler pod. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2521) Scheduler deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17838821#comment-17838821 ] Xi Chen commented on YUNIKORN-2521: --- [~ccondit], [~pbacsko] Thanks for looking into this! We are currently relying on the deadlock detection to restart YuniKorn. I can switch to livenessProbe using `/ws/v1/fullstatedump` to see if this really hangs the scheduler. WDYT? > Scheduler deadlock > -- > > Key: YUNIKORN-2521 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2521 > Project: Apache YuniKorn > Issue Type: Bug >Affects Versions: 1.5.0 > Environment: Yunikorn: 1.5 > AWS EKS: v1.28.6-eks-508b6b3 >Reporter: Noah Yoshida >Assignee: Craig Condit >Priority: Critical > Attachments: 0001-YUNIKORN-2539-core.patch, > 0002-YUNIKORN-2539-k8shim.patch, 4_4_goroutine-1.txt, 4_4_goroutine-2.txt, > 4_4_goroutine-3.txt, 4_4_goroutine-4.txt, 4_4_goroutine-5-state-dump.txt, > 4_4_profile001.png, 4_4_profile002.png, 4_4_profile003.png, > 4_4_scheduler-logs.txt, deadlock_2024-04-18.log, goroutine-4-3-1.out, > goroutine-4-3-2.out, goroutine-4-3-3.out, goroutine-4-3.out, > goroutine-4-5.out, goroutine-dump.txt, goroutine-while-blocking-2.out, > goroutine-while-blocking.out, logs-potential-deadlock-2.txt, > logs-potential-deadlock.txt, logs-splunk-ordered.txt, logs-splunk.txt, > profile001-4-5.gif, profile012.gif, profile013.gif, running-logs-2.txt, > running-logs.txt > > > Discussion on Yunikorn slack: > [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1711048995187179] > Occasionally, Yunikorn will deadlock and prevent any new pods from starting. > All pods stay in Pending. There are no error logs inside of the Yunikorn > scheduler indicating any issue. > Additionally, the pods all have the correct annotations / labels from the > admission service, so they are at least getting put into k8s correctly. > The issue was seen intermittently on Yunikorn version 1.5 in EKS, using > version `v1.28.6-eks-508b6b3`. > At least for me, we run about 25-50 nodes and 200-400 pods. Pods and nodes > are added and removed pretty frequently as we do ML workloads. > Attached is the goroutine dump. We were not able to get a statedump as the > endpoint kept timing out. > You can fix it by restarting the Yunikorn scheduler pod. Sometimes you also > have to delete any "Pending" pods that got stuck while the scheduler was > deadlocked as well, for them to get picked up by the new scheduler pod. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2521) Scheduler deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17838707#comment-17838707 ] Xi Chen commented on YUNIKORN-2521: --- Hey [~pbacsko], I used the branch-1.5 to build the scheduler and deployed in our environment. Our settings include multiple namespaces running Spark K8s jobs. The scheduler was still getting POTENTIAL DEADLOCK. Please check the log [^deadlock_2024-04-18.log], thanks! > Scheduler deadlock > -- > > Key: YUNIKORN-2521 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2521 > Project: Apache YuniKorn > Issue Type: Bug >Affects Versions: 1.5.0 > Environment: Yunikorn: 1.5 > AWS EKS: v1.28.6-eks-508b6b3 >Reporter: Noah Yoshida >Assignee: Craig Condit >Priority: Critical > Attachments: 0001-YUNIKORN-2539-core.patch, > 0002-YUNIKORN-2539-k8shim.patch, 4_4_goroutine-1.txt, 4_4_goroutine-2.txt, > 4_4_goroutine-3.txt, 4_4_goroutine-4.txt, 4_4_goroutine-5-state-dump.txt, > 4_4_profile001.png, 4_4_profile002.png, 4_4_profile003.png, > 4_4_scheduler-logs.txt, deadlock_2024-04-18.log, goroutine-4-3-1.out, > goroutine-4-3-2.out, goroutine-4-3-3.out, goroutine-4-3.out, > goroutine-4-5.out, goroutine-dump.txt, goroutine-while-blocking-2.out, > goroutine-while-blocking.out, logs-potential-deadlock-2.txt, > logs-potential-deadlock.txt, logs-splunk-ordered.txt, logs-splunk.txt, > profile001-4-5.gif, profile012.gif, profile013.gif, running-logs-2.txt, > running-logs.txt > > > Discussion on Yunikorn slack: > [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1711048995187179] > Occasionally, Yunikorn will deadlock and prevent any new pods from starting. > All pods stay in Pending. There are no error logs inside of the Yunikorn > scheduler indicating any issue. > Additionally, the pods all have the correct annotations / labels from the > admission service, so they are at least getting put into k8s correctly. > The issue was seen intermittently on Yunikorn version 1.5 in EKS, using > version `v1.28.6-eks-508b6b3`. > At least for me, we run about 25-50 nodes and 200-400 pods. Pods and nodes > are added and removed pretty frequently as we do ML workloads. > Attached is the goroutine dump. We were not able to get a statedump as the > endpoint kept timing out. > You can fix it by restarting the Yunikorn scheduler pod. Sometimes you also > have to delete any "Pending" pods that got stuck while the scheduler was > deadlocked as well, for them to get picked up by the new scheduler pod. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2521) Scheduler deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xi Chen updated YUNIKORN-2521: -- Attachment: deadlock_2024-04-18.log > Scheduler deadlock > -- > > Key: YUNIKORN-2521 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2521 > Project: Apache YuniKorn > Issue Type: Bug >Affects Versions: 1.5.0 > Environment: Yunikorn: 1.5 > AWS EKS: v1.28.6-eks-508b6b3 >Reporter: Noah Yoshida >Assignee: Craig Condit >Priority: Critical > Attachments: 0001-YUNIKORN-2539-core.patch, > 0002-YUNIKORN-2539-k8shim.patch, 4_4_goroutine-1.txt, 4_4_goroutine-2.txt, > 4_4_goroutine-3.txt, 4_4_goroutine-4.txt, 4_4_goroutine-5-state-dump.txt, > 4_4_profile001.png, 4_4_profile002.png, 4_4_profile003.png, > 4_4_scheduler-logs.txt, deadlock_2024-04-18.log, goroutine-4-3-1.out, > goroutine-4-3-2.out, goroutine-4-3-3.out, goroutine-4-3.out, > goroutine-4-5.out, goroutine-dump.txt, goroutine-while-blocking-2.out, > goroutine-while-blocking.out, logs-potential-deadlock-2.txt, > logs-potential-deadlock.txt, logs-splunk-ordered.txt, logs-splunk.txt, > profile001-4-5.gif, profile012.gif, profile013.gif, running-logs-2.txt, > running-logs.txt > > > Discussion on Yunikorn slack: > [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1711048995187179] > Occasionally, Yunikorn will deadlock and prevent any new pods from starting. > All pods stay in Pending. There are no error logs inside of the Yunikorn > scheduler indicating any issue. > Additionally, the pods all have the correct annotations / labels from the > admission service, so they are at least getting put into k8s correctly. > The issue was seen intermittently on Yunikorn version 1.5 in EKS, using > version `v1.28.6-eks-508b6b3`. > At least for me, we run about 25-50 nodes and 200-400 pods. Pods and nodes > are added and removed pretty frequently as we do ML workloads. > Attached is the goroutine dump. We were not able to get a statedump as the > endpoint kept timing out. > You can fix it by restarting the Yunikorn scheduler pod. Sometimes you also > have to delete any "Pending" pods that got stuck while the scheduler was > deadlocked as well, for them to get picked up by the new scheduler pod. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org