Peter Bacsko created YUNIKORN-1714:
--------------------------------------

             Summary: Fatal error: concurrent write/read when calling 
Queue.RemoveApplication()
                 Key: YUNIKORN-1714
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1714
             Project: Apache YuniKorn
          Issue Type: Bug
          Components: core - scheduler
            Reporter: Peter Bacsko


Encountered this problem when doing some local testing with lot of running 
applications:

{noformat}
fatal error: concurrent map read and map write

goroutine 8785 [running]:
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).RemoveApplication(0xc0002e0840,
 0xc004a1cc40)
        
/home/bacskop/repos/incubator-yunikorn-core/pkg/scheduler/objects/queue.go:697 
+0x65
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).UnSetQueue(0xc004a1cc40)
        
/home/bacskop/repos/incubator-yunikorn-core/pkg/scheduler/objects/application.go:1493
 +0x45
github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).moveTerminatedApp(0xc0002aa600,
 {0xc00372e4e0, 0x16})
        
/home/bacskop/repos/incubator-yunikorn-core/pkg/scheduler/partition.go:1409 
+0x73
created by 
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).executeTerminatedCallback
        
/home/bacskop/repos/incubator-yunikorn-core/pkg/scheduler/objects/application.go:1831
 +0xaa

...

goroutine 8782 [runnable]:
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).timeoutStateTimer.func1()
        
/home/bacskop/repos/incubator-yunikorn-core/pkg/scheduler/objects/application.go:298
created by time.goFunc
        /snap/go/current/src/time/sleep.go:176 +0x32

goroutine 8623 [runnable]:
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).executeTerminatedCallback.func1()
        
/home/bacskop/repos/incubator-yunikorn-core/pkg/scheduler/objects/application.go:1831
runtime.goexit()
        /snap/go/current/src/runtime/asm_amd64.s:1598 +0x1
created by 
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).executeTerminatedCallback
        
/home/bacskop/repos/incubator-yunikorn-core/pkg/scheduler/objects/application.go:1831
 +0xaa

goroutine 8786 [runnable]:
go.uber.org/zap.(*stacktrace).Next(...)
        /home/bacskop/go/pkg/mod/go.uber.org/[email protected]/stacktrace.go:127
go.uber.org/zap.(*Logger).check(0xc0003bb650, 0x0, {0x1e6c20c, 0x2c})
        /home/bacskop/go/pkg/mod/go.uber.org/[email protected]/logger.go:372 +0x7e5
go.uber.org/zap.(*Logger).Info(0xc0002e0420?, {0x1e6c20c?, 0x1?}, 
{0xc005745680, 0x2, 0x2})
        /home/bacskop/go/pkg/mod/go.uber.org/[email protected]/logger.go:219 +0x3b
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).RemoveApplication(0xc0002e0840,
 0xc004aa0380)
        
/home/bacskop/repos/incubator-yunikorn-core/pkg/scheduler/objects/queue.go:742 
+0xcc6
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).UnSetQueue(0xc004aa0380)
        
/home/bacskop/repos/incubator-yunikorn-core/pkg/scheduler/objects/application.go:1493
 +0x45
github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).moveTerminatedApp(0xc0002aa600,
 {0xc00372e498, 0x16})
        
/home/bacskop/repos/incubator-yunikorn-core/pkg/scheduler/partition.go:1409 
+0x73
created by 
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).executeTerminatedCallback
        
/home/bacskop/repos/incubator-yunikorn-core/pkg/scheduler/objects/application.go:1831
 +0xaa
{noformat}

There is an unprotected access to {{sq.applications[]}}, the code checks if an 
application exist without locking. But this can fail because the map can be 
modified concurrently, which Go detects and does not allow.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to