[ 
https://issues.apache.org/jira/browse/YUNIKORN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17838510#comment-17838510
 ] 

Peter Bacsko edited comment on YUNIKORN-2562 at 4/18/24 3:24 PM:
-----------------------------------------------------------------

Adding more comments - Actually queue capacity gradually degrades even though 
we have capacity available [ example - Lets say my Max allocation is 1.5 TB so 
initially it works well but post few days [2+ days ]this utilisation come down 
to 60% of max capacity where inspite of available resources queue max capacity 
gets limited to 55-65% a[ max] and upon restart yunikorn keep n crashing for 
long time ...eventually after few minutes [ 15-20 minutes to 1 hour ]it starts 
working again . Adding few logs here :

 
{noformat}
41d77", "placeholder": false, "pendingDelta": "map[memory:4723834880 pods:1 
vcore:1000]"}

2024-04-18T06:49:34.944Z INFO core.scheduler.queue objects/queue.go:1408 
allocation found on queue \{"queueName": "root.xxx-spark", "appID": 
"application-spark-4rrgafat101r", "allocation": 
"applicationID=application-spark-4rrgafat101r, 
allocationID=e2d99aaf-6889-4d48-ac70-69e286c41d77-0, 
allocationKey=e2d99aaf-6889-4d48-ac70-69e286c41d77, 
Node=aks-obemuatnew-34197442-vmss000008, result=Replaced"}
2024-04-18T06:49:34.944Z INFO core.scheduler.partition 
scheduler/partition.go:867 scheduler replace placeholder processed \{"appID": 
"application-spark-4rrgafat101r", "allocationKey": 
"e2d99aaf-6889-4d48-ac70-69e286c41d77", "allocationID": 
"e2d99aaf-6889-4d48-ac70-69e286c41d77-0", "placeholder released allocationID": 
"9f0e05fa-3d83-4dda-b993-b696af298420-0"}
2024-04-18T06:49:34.945Z INFO shim.cache.application cache/application.go:602 
try to release pod from application \{"appID": 
"application-spark-4rrgafat101r", "allocationID": 
"9f0e05fa-3d83-4dda-b993-b696af298420-0", "terminationType": 
"PLACEHOLDER_REPLACED"}
2024-04-18T06:49:35.017Z INFO core.scheduler scheduler/scheduler.go:101 Found 
outstanding requests that will trigger autoscaling \{"number of requests": 1, 
"total resources": "map[memory:11811160064 pods:1 vcore:2000]"}
2024-04-18T06:49:35.077Z INFO shim.context cache/context.go:1123 task added 
\{"appID": "application-spark-34b5vjdbgeb4", "taskID": 
"5ca32f14-df38-48b3-b420-e17f557dfa33", "taskState": "New"}
2024-04-18T06:49:35.139Z INFO shim.cache.task cache/task.go:542 releasing 
allocations \{"numOfAsksToRelease": 1, "numOfAllocationsToRelease": 1}
2024-04-18T06:49:35.139Z INFO shim.fsm cache/task_state.go:380 Task state 
transition \{"app": "application-spark-x2bwqi3mjr5q", "task": 
"7d21cb2a-3d50-45e7-8285-46d0428249e3", "taskAlias": 
"obem-spark/tg-application-spark-x2bwqi3mjr-spark-driver-llg4emobvz", "source": 
"Bound", "destination": "Completed", "event": "CompleteTask"}
2024-04-18T06:49:35.139Z INFO core.scheduler.application 
objects/application.go:616 ask removed successfully from application \{"appID": 
"application-spark-x2bwqi3mjr5q", "ask": 
"7d21cb2a-3d50-45e7-8285-46d0428249e3", "pendingDelta": "map[]"}
2024-04-18T06:49:35.139Z INFO core.scheduler.partition 
scheduler/partition.go:1281 replacing placeholder allocation \{"appID": 
"application-spark-x2bwqi3mjr5q", "allocationID": 
"7d21cb2a-3d50-45e7-8285-46d0428249e3"}
panic: runtime error: invalid memory address or nil pointer dereference

[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x17e1255]


goroutine 129 [running]:
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).ReplaceAllocation(0xc007dcfc00,
 \{0xc00630a390, 0x24})
 
github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/objects/application.go:1745
 +0x615
github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation(0xffffffffffffffff?,
 0xc007f19b00)
github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/partition.go:1284 +0x28b
github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases(0xc004562ba0?,
 \{0xc0098172a0, 0x1, 0x40a0fa?}, \{0x1e0d902, 0x9})
github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:870 +0x9e
github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent(0xc003a43f58?,
 0xc003a43f10?)
github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:750 +0xa5
github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent(0xc000428390)
github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:133 +0x1c5
created by 
github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService in 
goroutine 1
github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:60 +0x9c
{noformat}
 

 


was (Author: JIRAUSER305116):
Adding more comments - Actually queue capacity gradually degrades even though 
we have capacity available [ example - Lets say my Max allocation is 1.5 TB so 
initially it works well but post few days [2+ days ]this utilisation come down 
to 60% of max capacity where inspite of available resources queue max capacity 
gets limited to 55-65% a[ max] and upon restart yunikorn keep n crashing for 
long time ...eventually after few minutes [ 15-20 minutes to 1 hour ]it starts 
working again . Adding few logs here :

 

41d77", "placeholder": false, "pendingDelta": "map[memory:4723834880 pods:1 
vcore:1000]"}

2024-04-18T06:49:34.944Z INFO core.scheduler.queue objects/queue.go:1408 
allocation found on queue \{"queueName": "root.xxx-spark", "appID": 
"application-spark-4rrgafat101r", "allocation": 
"applicationID=application-spark-4rrgafat101r, 
allocationID=e2d99aaf-6889-4d48-ac70-69e286c41d77-0, 
allocationKey=e2d99aaf-6889-4d48-ac70-69e286c41d77, 
Node=aks-obemuatnew-34197442-vmss000008, result=Replaced"}

2024-04-18T06:49:34.944Z INFO core.scheduler.partition 
scheduler/partition.go:867 scheduler replace placeholder processed \{"appID": 
"application-spark-4rrgafat101r", "allocationKey": 
"e2d99aaf-6889-4d48-ac70-69e286c41d77", "allocationID": 
"e2d99aaf-6889-4d48-ac70-69e286c41d77-0", "placeholder released allocationID": 
"9f0e05fa-3d83-4dda-b993-b696af298420-0"}

2024-04-18T06:49:34.945Z INFO shim.cache.application cache/application.go:602 
try to release pod from application \{"appID": 
"application-spark-4rrgafat101r", "allocationID": 
"9f0e05fa-3d83-4dda-b993-b696af298420-0", "terminationType": 
"PLACEHOLDER_REPLACED"}

2024-04-18T06:49:35.017Z INFO core.scheduler scheduler/scheduler.go:101 Found 
outstanding requests that will trigger autoscaling \{"number of requests": 1, 
"total resources": "map[memory:11811160064 pods:1 vcore:2000]"}

2024-04-18T06:49:35.077Z INFO shim.context cache/context.go:1123 task added 
\{"appID": "application-spark-34b5vjdbgeb4", "taskID": 
"5ca32f14-df38-48b3-b420-e17f557dfa33", "taskState": "New"}

2024-04-18T06:49:35.139Z INFO shim.cache.task cache/task.go:542 releasing 
allocations \{"numOfAsksToRelease": 1, "numOfAllocationsToRelease": 1}

2024-04-18T06:49:35.139Z INFO shim.fsm cache/task_state.go:380 Task state 
transition \{"app": "application-spark-x2bwqi3mjr5q", "task": 
"7d21cb2a-3d50-45e7-8285-46d0428249e3", "taskAlias": 
"obem-spark/tg-application-spark-x2bwqi3mjr-spark-driver-llg4emobvz", "source": 
"Bound", "destination": "Completed", "event": "CompleteTask"}

2024-04-18T06:49:35.139Z INFO core.scheduler.application 
objects/application.go:616 ask removed successfully from application \{"appID": 
"application-spark-x2bwqi3mjr5q", "ask": 
"7d21cb2a-3d50-45e7-8285-46d0428249e3", "pendingDelta": "map[]"}

2024-04-18T06:49:35.139Z INFO core.scheduler.partition 
scheduler/partition.go:1281 replacing placeholder allocation \{"appID": 
"application-spark-x2bwqi3mjr5q", "allocationID": 
"7d21cb2a-3d50-45e7-8285-46d0428249e3"}

panic: runtime error: invalid memory address or nil pointer dereference

[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x17e1255]

 

goroutine 129 [running]:

github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).ReplaceAllocation(0xc007dcfc00,
 \{0xc00630a390, 0x24})

 
github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/objects/application.go:1745
 +0x615

github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation(0xffffffffffffffff?,
 0xc007f19b00)

 github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/partition.go:1284 +0x28b

github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases(0xc004562ba0?,
 \{0xc0098172a0, 0x1, 0x40a0fa?}, \{0x1e0d902, 0x9})

 github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:870 +0x9e

github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent(0xc003a43f58?,
 0xc003a43f10?)

 github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:750 +0xa5

github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent(0xc000428390)

 github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:133 +0x1c5

created by 
github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService in 
goroutine 1

 github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:60 +0x9c

 

 

> Nil pointer in Application.ReplaceAllocation()
> ----------------------------------------------
>
>                 Key: YUNIKORN-2562
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2562
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>            Reporter: Peter Bacsko
>            Assignee: Peter Bacsko
>            Priority: Major
>
> The following panic was generated during placeholder replacement:
> {noformat}
> 2024-04-16T13:46:58.583Z      INFO    shim.cache.task cache/task.go:542       
> releasing allocations   {"numOfAsksToRelease": 1, 
> "numOfAllocationsToRelease": 1}
> 2024-04-16T13:46:58.583Z      INFO    shim.fsm        cache/task_state.go:380 
> Task state transition   {"app": "application-spark-abrdrsmo8no2", "task": 
> "cd73be15-af61-4248-89e1-d3296e72214e", "taskAlias": 
> "obem-spark/tg-application-spark-abrdrsmo8n-spark-driver-y71h0amzo5", 
> "source": "Bound", "destination": "Completed", "event": "CompleteTask"}
> 2024-04-16T13:46:58.584Z      INFO    core.scheduler.application      
> objects/application.go:616      ask removed successfully from application     
>   {"appID": "application-spark-abrdrsmo8no2", "ask": 
> "cd73be15-af61-4248-89e1-d3296e72214e", "pendingDelta": "map[]"}
> 2024-04-16T13:46:58.584Z      INFO    core.scheduler.partition        
> scheduler/partition.go:1281     replacing placeholder allocation        
> {"appID": "application-spark-abrdrsmo8no2", "allocationID": 
> "cd73be15-af61-4248-89e1-d3296e72214e"}
> panic: runtime error: invalid memory address or nil pointer dereference
> [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x17e1255]
> goroutine 117 [running]:
> github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).ReplaceAllocation(0xc008c46600,
>  {0xc007710cf0, 0x24})
>       
> github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/objects/application.go:1745
>  +0x615
> github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation(0xffffffffffffffff?,
>  0xc009786700)
>       
> github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/partition.go:1284 
> +0x28b
> github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases(0xc00be64ba0?,
>  {0xc00bb1af90, 0x1, 0x40a0fa?}, {0x1e0d902, 0x9})
>       github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:870 
> +0x9e
> github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent(0xc0005f5f58?,
>  0xc0071a3f10?)
>       github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:750 
> +0xa5
> github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent(0xc000700540)
>       github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:133 
> +0x1c5
> created by 
> github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService in 
> goroutine 1
>       github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:60 
> +0x9c
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org

Reply via email to