Peter Bacsko created YUNIKORN-1714:
--------------------------------------
Summary: Fatal error: concurrent write/read when calling
Queue.RemoveApplication()
Key: YUNIKORN-1714
URL: https://issues.apache.org/jira/browse/YUNIKORN-1714
Project: Apache YuniKorn
Issue Type: Bug
Components: core - scheduler
Reporter: Peter Bacsko
Encountered this problem when doing some local testing with lot of running
applications:
{noformat}
fatal error: concurrent map read and map write
goroutine 8785 [running]:
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).RemoveApplication(0xc0002e0840,
0xc004a1cc40)
/home/bacskop/repos/incubator-yunikorn-core/pkg/scheduler/objects/queue.go:697
+0x65
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).UnSetQueue(0xc004a1cc40)
/home/bacskop/repos/incubator-yunikorn-core/pkg/scheduler/objects/application.go:1493
+0x45
github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).moveTerminatedApp(0xc0002aa600,
{0xc00372e4e0, 0x16})
/home/bacskop/repos/incubator-yunikorn-core/pkg/scheduler/partition.go:1409
+0x73
created by
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).executeTerminatedCallback
/home/bacskop/repos/incubator-yunikorn-core/pkg/scheduler/objects/application.go:1831
+0xaa
...
goroutine 8782 [runnable]:
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).timeoutStateTimer.func1()
/home/bacskop/repos/incubator-yunikorn-core/pkg/scheduler/objects/application.go:298
created by time.goFunc
/snap/go/current/src/time/sleep.go:176 +0x32
goroutine 8623 [runnable]:
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).executeTerminatedCallback.func1()
/home/bacskop/repos/incubator-yunikorn-core/pkg/scheduler/objects/application.go:1831
runtime.goexit()
/snap/go/current/src/runtime/asm_amd64.s:1598 +0x1
created by
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).executeTerminatedCallback
/home/bacskop/repos/incubator-yunikorn-core/pkg/scheduler/objects/application.go:1831
+0xaa
goroutine 8786 [runnable]:
go.uber.org/zap.(*stacktrace).Next(...)
/home/bacskop/go/pkg/mod/go.uber.org/[email protected]/stacktrace.go:127
go.uber.org/zap.(*Logger).check(0xc0003bb650, 0x0, {0x1e6c20c, 0x2c})
/home/bacskop/go/pkg/mod/go.uber.org/[email protected]/logger.go:372 +0x7e5
go.uber.org/zap.(*Logger).Info(0xc0002e0420?, {0x1e6c20c?, 0x1?},
{0xc005745680, 0x2, 0x2})
/home/bacskop/go/pkg/mod/go.uber.org/[email protected]/logger.go:219 +0x3b
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).RemoveApplication(0xc0002e0840,
0xc004aa0380)
/home/bacskop/repos/incubator-yunikorn-core/pkg/scheduler/objects/queue.go:742
+0xcc6
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).UnSetQueue(0xc004aa0380)
/home/bacskop/repos/incubator-yunikorn-core/pkg/scheduler/objects/application.go:1493
+0x45
github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).moveTerminatedApp(0xc0002aa600,
{0xc00372e498, 0x16})
/home/bacskop/repos/incubator-yunikorn-core/pkg/scheduler/partition.go:1409
+0x73
created by
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).executeTerminatedCallback
/home/bacskop/repos/incubator-yunikorn-core/pkg/scheduler/objects/application.go:1831
+0xaa
{noformat}
There is an unprotected access to {{sq.applications[]}}, the code checks if an
application exist without locking. But this can fail because the map can be
modified concurrently, which Go detects and does not allow.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]