[ https://issues.apache.org/jira/browse/YUNIKORN-2731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xi Chen updated YUNIKORN-2731: ------------------------------ Attachment: Screenshot 2024-07-09 at 2.40.51 PM.png > YuniKorn stopped scheduling new containers with negative vcore in queue > ----------------------------------------------------------------------- > > Key: YUNIKORN-2731 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2731 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler > Affects Versions: 1.5.1 > Reporter: Xi Chen > Priority: Major > Attachments: Applications stuck in Accepted status.png > > > We have encountered this issue in one of our clusters every a few days. We > are running a version that is built from branch > [https://github.com/apache/yunikorn-k8shim/commits/branch-1.5/] commit > fb4e3f11345e6a9866dfaea97770c94b9421807b. > Here is our configuration of queues.yaml. > > {code:java} > partitions: > - name: default > nodesortpolicy: > type: binpacking > preemption: > enabled: false > placementrules: > - name: tag > value: namespace > create: false > queues: > - name: root > submitacl: '*' > queues: > - name: c > resources: > guaranteed: > memory: 13000Gi > vcore: 3250 > max: > memory: 13000Gi > vcore: 3250 > properties: > application.sort.policy: fair > - name: e > resources: > guaranteed: > memory: 2600Gi > vcore: 650 > max: > memory: 2600Gi > vcore: 650 > properties: > application.sort.policy: fair > - name: m1 > resources: > guaranteed: > memory: 1000Gi > vcore: 250 > max: > memory: 1000Gi > vcore: 250 > properties: > application.sort.policy: fair > - name: m2 > resources: > guaranteed: > memory: 62000Gi > vcore: 15500 > max: > memory: 62000Gi > vcore: 15500 > properties: > application.sort.policy: fair {code} > The issue is that at some point the scheduler would stop starting new > containers, and there would be 0 containers running finally and lots of > applications in Accepted status. > > There are some logs that contains negative vcore resource, and these logs are > highly corralated with this issue in timeline. > > {code:java} > 2024-07-08T10:19:13.436Z INFO core.scheduler > scheduler/scheduler.go:101 Found outstanding requests that will trigger > autoscaling {"number of requests": 1, "total resources": > "map[memory:2147483648 pods:1 vcore:500]"} > E0708 10:19:13.604563 1 event_broadcaster.go:270] "Server rejected > event (will not retry!)" err="Event > \"example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76\" is > invalid: [action: Required value, reason: Required value]" > event="&Event{ObjectMeta:{example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76 > c 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] > []},EventTime:2024-07-08 10:19:13.60205325 +0000 UTC > m=+524778.657618517,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod > c example-job-1720433945-574-aa32179091daba13-driver > a1901455-3a23-4dbc-bbb2-9dd2dfc775ea v1 201821875 },Related:nil,Note:Request > 'a1901455-3a23-4dbc-bbb2-9dd2dfc775ea' does not fit in queue 'root.c' > (requested map[memory:2147483648 pods:1 vcore:500], available > map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 > hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 > vcore:-56150]),Type:Normal,DeprecatedSource:{ > },DeprecatedFirstTimestamp:0001-01-01 00:00:00 +0000 > UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedCount:0,}" > 2024-07-08T10:19:05.391Z INFO core.scheduler > scheduler/scheduler.go:101 Found outstanding requests that will trigger > autoscaling {"number of requests": 1, "total resources": > "map[memory:2147483648 pods:1 vcore:500]"} > E0708 10:19:05.601679 1 event_broadcaster.go:270] "Server rejected > event (will not retry!)" err="Event > \"example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4\" is > invalid: [action: Required value, reason: Required value]" > event="&Event{ObjectMeta:{example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4 > e 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] > []},EventTime:2024-07-08 10:19:05.599216316 +0000 UTC > m=+524770.654781585,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod > e example-job-1720433937-295-e7b2229091da99a7-driver > 14a40bac-5e89-4293-bcb7-936c544694a2 v1 201821666 },Related:nil,Note:Request > '14a40bac-5e89-4293-bcb7-936c544694a2' does not fit in queue 'root.e' > (requested map[memory:2147483648 pods:1 vcore:500], available > map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 > hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 > vcore:-56150]),Type:Normal,DeprecatedSource:{ > },DeprecatedFirstTimestamp:0001-01-01 00:00:00 +0000 > UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedCount:0,}" > 2024-07-08T10:18:51.325Z INFO core.scheduler > scheduler/scheduler.go:101 Found outstanding requests that will trigger > autoscaling {"number of requests": 1, "total resources": > "map[memory:2147483648 pods:1 vcore:500]"} > E0708 10:18:51.596390 1 event_broadcaster.go:270] "Server rejected > event (will not retry!)" err="Event > \"example-job-1720433923-763-378d629091da6500-driver.17e0358ba82f71a5\" is > invalid: [action: Required value, reason: Required value]" > event="&Event{ObjectMeta:{example-job-1720433923-763-378d629091da6500-driver.17e0358ba82f71a5 > m1 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] > []},EventTime:2024-07-08 10:18:51.593930204 +0000 UTC > m=+524756.649495472,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod > m1 example-job-1720433923-763-378d629091da6500-driver > f0c19c6a-6eb5-4e68-808d-389862c197cb v1 201821358 },Related:nil,Note:Request > 'f0c19c6a-6eb5-4e68-808d-389862c197cb' does not fit in queue 'root.m1' > (requested map[memory:2147483648 pods:1 vcore:500], available > map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 > hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 > vcore:-56150]),Type:Normal,DeprecatedSource:{ > },DeprecatedFirstTimestamp:0001-01-01 00:00:00 +0000 > UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedCount:0,}" > 2024-07-08T10:18:03.231Z INFO shim.context cache/context.go:1139 > app request originating pod added {"appID": > "spark-26e1b4f9c3124376aad12a9b63c8b711", "original task": > "ffdf1356-4a7f-4559-9cbd-afa510f96cfe"} > E0708 10:18:03.584031 1 event_broadcaster.go:270] "Server rejected > event (will not retry!)" err="Event > \"another-example-job-1720433872-277-38b8b09091d9a492-driver.17e035807a6bb0df\" > is invalid: [action: Required value, reason: Required value]" > event="&Event{ObjectMeta:{another-example-job-1720433872-277-38b8b09091d9a492-driver.17e035807a6bb0df > m2 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] > []},EventTime:2024-07-08 10:18:03.581485338 +0000 UTC > m=+524708.637050606,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod > m2 another-example-job-1720433872-277-38b8b09091d9a492-driver > 9b99dd53-cd1d-48b4-a8e3-c0c58f98a503 v1 201820328 },Related:nil,Note:Request > '9b99dd53-cd1d-48b4-a8e3-c0c58f98a503' does not fit in queue 'root.m2' > (requested map[memory:3758096384 pods:1 vcore:500], available > map[ephemeral-storage:2490103866211 hugepages-1Gi:0 hugepages-2Mi:0 > hugepages-32Mi:0 hugepages-64Ki:0 memory:352023250073 pods:1635 > vcore:-87850]),Type:Normal,DeprecatedSource:{ > },DeprecatedFirstTimestamp:0001-01-01 00:00:00 +0000 > UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedCount:0,}" > {code} > > > There are also some warnings about Scheduler is not healthy, but those logs > were there before the issue started > > {code:java} > 2024-07-08T10:19:24.990Z WARN core.scheduler.health > scheduler/health_checker.go:178 Scheduler is not healthy {"name": > "Consistency of data", "description": "Check if a partition's allocated > resource <= total resource of the partition", "message": "Partitions with > inconsistent data: [\"[foo-spark]default\"]"} {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org