[ https://issues.apache.org/jira/browse/YUNIKORN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17838510#comment-17838510 ]
Peter Bacsko edited comment on YUNIKORN-2562 at 4/18/24 3:24 PM: ----------------------------------------------------------------- Adding more comments - Actually queue capacity gradually degrades even though we have capacity available [ example - Lets say my Max allocation is 1.5 TB so initially it works well but post few days [2+ days ]this utilisation come down to 60% of max capacity where inspite of available resources queue max capacity gets limited to 55-65% a[ max] and upon restart yunikorn keep n crashing for long time ...eventually after few minutes [ 15-20 minutes to 1 hour ]it starts working again . Adding few logs here : {noformat} 41d77", "placeholder": false, "pendingDelta": "map[memory:4723834880 pods:1 vcore:1000]"} 2024-04-18T06:49:34.944Z INFO core.scheduler.queue objects/queue.go:1408 allocation found on queue \{"queueName": "root.xxx-spark", "appID": "application-spark-4rrgafat101r", "allocation": "applicationID=application-spark-4rrgafat101r, allocationID=e2d99aaf-6889-4d48-ac70-69e286c41d77-0, allocationKey=e2d99aaf-6889-4d48-ac70-69e286c41d77, Node=aks-obemuatnew-34197442-vmss000008, result=Replaced"} 2024-04-18T06:49:34.944Z INFO core.scheduler.partition scheduler/partition.go:867 scheduler replace placeholder processed \{"appID": "application-spark-4rrgafat101r", "allocationKey": "e2d99aaf-6889-4d48-ac70-69e286c41d77", "allocationID": "e2d99aaf-6889-4d48-ac70-69e286c41d77-0", "placeholder released allocationID": "9f0e05fa-3d83-4dda-b993-b696af298420-0"} 2024-04-18T06:49:34.945Z INFO shim.cache.application cache/application.go:602 try to release pod from application \{"appID": "application-spark-4rrgafat101r", "allocationID": "9f0e05fa-3d83-4dda-b993-b696af298420-0", "terminationType": "PLACEHOLDER_REPLACED"} 2024-04-18T06:49:35.017Z INFO core.scheduler scheduler/scheduler.go:101 Found outstanding requests that will trigger autoscaling \{"number of requests": 1, "total resources": "map[memory:11811160064 pods:1 vcore:2000]"} 2024-04-18T06:49:35.077Z INFO shim.context cache/context.go:1123 task added \{"appID": "application-spark-34b5vjdbgeb4", "taskID": "5ca32f14-df38-48b3-b420-e17f557dfa33", "taskState": "New"} 2024-04-18T06:49:35.139Z INFO shim.cache.task cache/task.go:542 releasing allocations \{"numOfAsksToRelease": 1, "numOfAllocationsToRelease": 1} 2024-04-18T06:49:35.139Z INFO shim.fsm cache/task_state.go:380 Task state transition \{"app": "application-spark-x2bwqi3mjr5q", "task": "7d21cb2a-3d50-45e7-8285-46d0428249e3", "taskAlias": "obem-spark/tg-application-spark-x2bwqi3mjr-spark-driver-llg4emobvz", "source": "Bound", "destination": "Completed", "event": "CompleteTask"} 2024-04-18T06:49:35.139Z INFO core.scheduler.application objects/application.go:616 ask removed successfully from application \{"appID": "application-spark-x2bwqi3mjr5q", "ask": "7d21cb2a-3d50-45e7-8285-46d0428249e3", "pendingDelta": "map[]"} 2024-04-18T06:49:35.139Z INFO core.scheduler.partition scheduler/partition.go:1281 replacing placeholder allocation \{"appID": "application-spark-x2bwqi3mjr5q", "allocationID": "7d21cb2a-3d50-45e7-8285-46d0428249e3"} panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x17e1255] goroutine 129 [running]: github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).ReplaceAllocation(0xc007dcfc00, \{0xc00630a390, 0x24}) github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/objects/application.go:1745 +0x615 github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation(0xffffffffffffffff?, 0xc007f19b00) github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/partition.go:1284 +0x28b github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases(0xc004562ba0?, \{0xc0098172a0, 0x1, 0x40a0fa?}, \{0x1e0d902, 0x9}) github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:870 +0x9e github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent(0xc003a43f58?, 0xc003a43f10?) github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:750 +0xa5 github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent(0xc000428390) github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:133 +0x1c5 created by github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService in goroutine 1 github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:60 +0x9c {noformat} was (Author: JIRAUSER305116): Adding more comments - Actually queue capacity gradually degrades even though we have capacity available [ example - Lets say my Max allocation is 1.5 TB so initially it works well but post few days [2+ days ]this utilisation come down to 60% of max capacity where inspite of available resources queue max capacity gets limited to 55-65% a[ max] and upon restart yunikorn keep n crashing for long time ...eventually after few minutes [ 15-20 minutes to 1 hour ]it starts working again . Adding few logs here : 41d77", "placeholder": false, "pendingDelta": "map[memory:4723834880 pods:1 vcore:1000]"} 2024-04-18T06:49:34.944Z INFO core.scheduler.queue objects/queue.go:1408 allocation found on queue \{"queueName": "root.xxx-spark", "appID": "application-spark-4rrgafat101r", "allocation": "applicationID=application-spark-4rrgafat101r, allocationID=e2d99aaf-6889-4d48-ac70-69e286c41d77-0, allocationKey=e2d99aaf-6889-4d48-ac70-69e286c41d77, Node=aks-obemuatnew-34197442-vmss000008, result=Replaced"} 2024-04-18T06:49:34.944Z INFO core.scheduler.partition scheduler/partition.go:867 scheduler replace placeholder processed \{"appID": "application-spark-4rrgafat101r", "allocationKey": "e2d99aaf-6889-4d48-ac70-69e286c41d77", "allocationID": "e2d99aaf-6889-4d48-ac70-69e286c41d77-0", "placeholder released allocationID": "9f0e05fa-3d83-4dda-b993-b696af298420-0"} 2024-04-18T06:49:34.945Z INFO shim.cache.application cache/application.go:602 try to release pod from application \{"appID": "application-spark-4rrgafat101r", "allocationID": "9f0e05fa-3d83-4dda-b993-b696af298420-0", "terminationType": "PLACEHOLDER_REPLACED"} 2024-04-18T06:49:35.017Z INFO core.scheduler scheduler/scheduler.go:101 Found outstanding requests that will trigger autoscaling \{"number of requests": 1, "total resources": "map[memory:11811160064 pods:1 vcore:2000]"} 2024-04-18T06:49:35.077Z INFO shim.context cache/context.go:1123 task added \{"appID": "application-spark-34b5vjdbgeb4", "taskID": "5ca32f14-df38-48b3-b420-e17f557dfa33", "taskState": "New"} 2024-04-18T06:49:35.139Z INFO shim.cache.task cache/task.go:542 releasing allocations \{"numOfAsksToRelease": 1, "numOfAllocationsToRelease": 1} 2024-04-18T06:49:35.139Z INFO shim.fsm cache/task_state.go:380 Task state transition \{"app": "application-spark-x2bwqi3mjr5q", "task": "7d21cb2a-3d50-45e7-8285-46d0428249e3", "taskAlias": "obem-spark/tg-application-spark-x2bwqi3mjr-spark-driver-llg4emobvz", "source": "Bound", "destination": "Completed", "event": "CompleteTask"} 2024-04-18T06:49:35.139Z INFO core.scheduler.application objects/application.go:616 ask removed successfully from application \{"appID": "application-spark-x2bwqi3mjr5q", "ask": "7d21cb2a-3d50-45e7-8285-46d0428249e3", "pendingDelta": "map[]"} 2024-04-18T06:49:35.139Z INFO core.scheduler.partition scheduler/partition.go:1281 replacing placeholder allocation \{"appID": "application-spark-x2bwqi3mjr5q", "allocationID": "7d21cb2a-3d50-45e7-8285-46d0428249e3"} panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x17e1255] goroutine 129 [running]: github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).ReplaceAllocation(0xc007dcfc00, \{0xc00630a390, 0x24}) github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/objects/application.go:1745 +0x615 github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation(0xffffffffffffffff?, 0xc007f19b00) github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/partition.go:1284 +0x28b github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases(0xc004562ba0?, \{0xc0098172a0, 0x1, 0x40a0fa?}, \{0x1e0d902, 0x9}) github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:870 +0x9e github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent(0xc003a43f58?, 0xc003a43f10?) github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:750 +0xa5 github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent(0xc000428390) github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:133 +0x1c5 created by github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService in goroutine 1 github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:60 +0x9c > Nil pointer in Application.ReplaceAllocation() > ---------------------------------------------- > > Key: YUNIKORN-2562 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2562 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler > Reporter: Peter Bacsko > Assignee: Peter Bacsko > Priority: Major > > The following panic was generated during placeholder replacement: > {noformat} > 2024-04-16T13:46:58.583Z INFO shim.cache.task cache/task.go:542 > releasing allocations {"numOfAsksToRelease": 1, > "numOfAllocationsToRelease": 1} > 2024-04-16T13:46:58.583Z INFO shim.fsm cache/task_state.go:380 > Task state transition {"app": "application-spark-abrdrsmo8no2", "task": > "cd73be15-af61-4248-89e1-d3296e72214e", "taskAlias": > "obem-spark/tg-application-spark-abrdrsmo8n-spark-driver-y71h0amzo5", > "source": "Bound", "destination": "Completed", "event": "CompleteTask"} > 2024-04-16T13:46:58.584Z INFO core.scheduler.application > objects/application.go:616 ask removed successfully from application > {"appID": "application-spark-abrdrsmo8no2", "ask": > "cd73be15-af61-4248-89e1-d3296e72214e", "pendingDelta": "map[]"} > 2024-04-16T13:46:58.584Z INFO core.scheduler.partition > scheduler/partition.go:1281 replacing placeholder allocation > {"appID": "application-spark-abrdrsmo8no2", "allocationID": > "cd73be15-af61-4248-89e1-d3296e72214e"} > panic: runtime error: invalid memory address or nil pointer dereference > [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x17e1255] > goroutine 117 [running]: > github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).ReplaceAllocation(0xc008c46600, > {0xc007710cf0, 0x24}) > > github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/objects/application.go:1745 > +0x615 > github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation(0xffffffffffffffff?, > 0xc009786700) > > github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/partition.go:1284 > +0x28b > github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases(0xc00be64ba0?, > {0xc00bb1af90, 0x1, 0x40a0fa?}, {0x1e0d902, 0x9}) > github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:870 > +0x9e > github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent(0xc0005f5f58?, > 0xc0071a3f10?) > github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:750 > +0xa5 > github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent(0xc000700540) > github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:133 > +0x1c5 > created by > github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService in > goroutine 1 > github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:60 > +0x9c > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org