[jira] [Created] (YUNIKORN-2559) DATA RACE: EventStore.Store() and Context.PublishEvents()
Yu-Lin Chen created YUNIKORN-2559: - Summary: DATA RACE: EventStore.Store() and Context.PublishEvents() Key: YUNIKORN-2559 URL: https://issues.apache.org/jira/browse/YUNIKORN-2559 Project: Apache YuniKorn Issue Type: Bug Components: core - common Reporter: Yu-Lin Chen Assignee: Yu-Lin Chen Attachments: shim-racing-log.txt How to reproduce: # In shim, update core version to the latest version (v0.0.0-20240415111844-72540e2b277f) # go mod tidy # Run 'make test > shim-racing-log.txt' {code:java} WARNING: DATA RACE Write at 0x00c003882008 by goroutine 59: github.com/apache/yunikorn-core/pkg/events.(*EventStore).Store() /home/chenyulin0719/go/pkg/mod/github.com/apache/yunikorn-core@v0.0.0-20240415111844-72540e2b277f/pkg/events/event_store.go:59 +0x1aa github.com/apache/yunikorn-core/pkg/events.(*EventSystemImpl).StartServiceWithPublisher.func2() /home/chenyulin0719/go/pkg/mod/github.com/apache/yunikorn-core@v0.0.0-20240415111844-72540e2b277f/pkg/events/event_system.go:194 +0x167Previous read at 0x00c003882008 by goroutine 60: github.com/apache/yunikorn-k8shim/pkg/cache.(*Context).PublishEvents() /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/cache/context.go:1176 +0x97 github.com/apache/yunikorn-k8shim/pkg/cache.(*AsyncRMCallback).SendEvent() /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/cache/scheduler_callback.go:235 +0xec github.com/apache/yunikorn-core/pkg/events.(*EventPublisher).StartService.func1() /home/chenyulin0719/go/pkg/mod/github.com/apache/yunikorn-core@v0.0.0-20240415111844-72540e2b277f/pkg/events/event_publisher.go:60 +0x27dGoroutine 59 (running) created at: github.com/apache/yunikorn-core/pkg/events.(*EventSystemImpl).StartServiceWithPublisher() /home/chenyulin0719/go/pkg/mod/github.com/apache/yunikorn-core@v0.0.0-20240415111844-72540e2b277f/pkg/events/event_system.go:183 +0x287 github.com/apache/yunikorn-core/pkg/events.(*EventSystemImpl).StartService() /home/chenyulin0719/go/pkg/mod/github.com/apache/yunikorn-core@v0.0.0-20240415111844-72540e2b277f/pkg/events/event_system.go:166 +0x2b github.com/apache/yunikorn-core/pkg/entrypoint.startAllServicesWithParameters() /home/chenyulin0719/go/pkg/mod/github.com/apache/yunikorn-core@v0.0.0-20240415111844-72540e2b277f/pkg/entrypoint/entrypoint.go:80 +0x9b github.com/apache/yunikorn-core/pkg/entrypoint.StartAllServices() /home/chenyulin0719/go/pkg/mod/github.com/apache/yunikorn-core@v0.0.0-20240415111844-72540e2b277f/pkg/entrypoint/entrypoint.go:43 +0x59 github.com/apache/yunikorn-k8shim/pkg/shim.(*MockScheduler).init() /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/shim/scheduler_mock_test.go:63 +0xad github.com/apache/yunikorn-k8shim/pkg/shim.TestApplicationScheduling() /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/shim/scheduler_test.go:60 +0x8c testing.tRunner() /usr/local/go/src/testing/testing.go:1689 +0x21e testing.(*T).Run.gowrap1() /usr/local/go/src/testing/testing.go:1742 +0x44Goroutine 60 (running) created at: github.com/apache/yunikorn-core/pkg/events.(*EventPublisher).StartService() /home/chenyulin0719/go/pkg/mod/github.com/apache/yunikorn-core@v0.0.0-20240415111844-72540e2b277f/pkg/events/event_publisher.go:50 +0xc4 github.com/apache/yunikorn-core/pkg/events.(*EventSystemImpl).StartServiceWithPublisher() /home/chenyulin0719/go/pkg/mod/github.com/apache/yunikorn-core@v0.0.0-20240415111844-72540e2b277f/pkg/events/event_system.go:203 +0x2b8 github.com/apache/yunikorn-core/pkg/events.(*EventSystemImpl).StartService() /home/chenyulin0719/go/pkg/mod/github.com/apache/yunikorn-core@v0.0.0-20240415111844-72540e2b277f/pkg/events/event_system.go:166 +0x2b github.com/apache/yunikorn-core/pkg/entrypoint.startAllServicesWithParameters() /home/chenyulin0719/go/pkg/mod/github.com/apache/yunikorn-core@v0.0.0-20240415111844-72540e2b277f/pkg/entrypoint/entrypoint.go:80 +0x9b github.com/apache/yunikorn-core/pkg/entrypoint.StartAllServices() /home/chenyulin0719/go/pkg/mod/github.com/apache/yunikorn-core@v0.0.0-20240415111844-72540e2b277f/pkg/entrypoint/entrypoint.go:43 +0x59 github.com/apache/yunikorn-k8shim/pkg/shim.(*MockScheduler).init() /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/shim/scheduler_mock_test.go:63 +0xad github.com/apache/yunikorn-k8shim/pkg/shim.TestApplicationScheduling() /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/shim/scheduler_test.go:60 +0x8c testing.tRunner() /usr/local/go/src/testing/testing.go:1689 +0x21e testing.(*T).Run.gowrap1() /usr/local/go/src/testing/testing.go:1742 +0x44 == {code} {*}Root cause{*}: EventStore.events were read/write by 2 different goroutine. * Goroutine 1: EventPublisher.StartService() call
[jira] [Created] (YUNIKORN-2558) Remove redundent conditional
Hsien-Cheng(Ryan) Huang created YUNIKORN-2558: - Summary: Remove redundent conditional Key: YUNIKORN-2558 URL: https://issues.apache.org/jira/browse/YUNIKORN-2558 Project: Apache YuniKorn Issue Type: Bug Components: shim - kubernetes Reporter: Hsien-Cheng(Ryan) Huang Fix For: 1.5.0 The script skips the {{make tools}} if the tools folder is existent. That is a bit weird that we don't check all tools in the folder. The script should call {{make tools}} anyway, and let {{make tools}} do the check and install. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2552) Recursive locking when sending remove queue event
[ https://issues.apache.org/jira/browse/YUNIKORN-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2552. Fix Version/s: 1.6.0 1.5.1 Resolution: Fixed > Recursive locking when sending remove queue event > - > > Key: YUNIKORN-2552 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2552 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0, 1.5.1 > > > When sending a queue event from {{queueEvents}}, we acquire the read lock > again. > {noformat} > objects.(*Queue).IsManaged { sq.RLock() } < > objects.(*Queue).IsManaged { func (sq *Queue) IsManaged() bool { } > objects.(*queueEvents).sendRemoveQueueEvent { } } > objects.(*Queue).RemoveQueue { sq.queueEvents.sendRemoveQueueEvent() } > scheduler.(*partitionManager).cleanQueues { // all OK update the queue > hierarchy and partition } > scheduler.(*partitionManager).cleanQueues { if children := > queue.GetCopyOfChildren(); len(children) != 0 { } > scheduler.(*partitionManager).cleanRoot { > manager.cleanQueues(manager.pc.root) } > {noformat} > {{RemoveQueue()}} already has the read lock. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2550) Fix locking in PartitionContext
[ https://issues.apache.org/jira/browse/YUNIKORN-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2550. Fix Version/s: 1.6.0 1.5.1 Target Version: 1.6.0, 1.5.1 Resolution: Fixed > Fix locking in PartitionContext > --- > > Key: YUNIKORN-2550 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2550 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - common >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0, 1.5.1 > > > Possible deadlock was detected: > {noformat} > placement.(*AppPlacementManager).initialise { m.Lock() } < > placement.(*AppPlacementManager).initialise { } } > placement.(*AppPlacementManager).UpdateRules { > log.Log(log.Config).Info("Building new rule list for placement manager") } > scheduler.(*PartitionContext).updatePartitionDetails { err := > pc.placementManager.UpdateRules(conf.PlacementRules) } > scheduler.(*ClusterContext).updateSchedulerConfig { err = > part.updatePartitionDetails(p) } > scheduler.(*ClusterContext).processRMConfigUpdateEvent { err = > cc.updateSchedulerConfig(conf, rmID) } > scheduler.(*Scheduler).handleRMEvent { case *rmevent.RMConfigUpdateEvent: } > scheduler.(*PartitionContext).GetQueue { pc.RLock() } < > scheduler.(*PartitionContext).GetQueue { func (pc *PartitionContext) > GetQueue(name string) *objects.Queue { } > placement.(*providedRule).placeApplication { // if we cannot create the queue > must exist } > placement.(*AppPlacementManager).PlaceApplication { queueName, err = > checkRule.placeApplication(app, m.queueFn) } > scheduler.(*PartitionContext).AddApplication { err := > pc.getPlacementManager().PlaceApplication(app) } > scheduler.(*ClusterContext).handleRMUpdateApplicationEvent { schedApp := > objects.NewApplication(app, ugi, cc.rmEventHandler, request.RmID) } > scheduler.(*Scheduler).handleRMEvent { case ev := <-s.pendingEvents: } > {noformat} > Lock order is different between {{PartitionContext}} and > {{AppPlacementManager}}. > There's also an interference between {{PartitionContext}} and an > {{Application}} object: > {noformat} > objects.(*Application).SetTerminatedCallback { sa.Lock() } < > objects.(*Application).SetTerminatedCallback { func (sa *Application) > SetTerminatedCallback(callback func(appID string)) { } > scheduler.(*PartitionContext).AddApplication { > app.SetTerminatedCallback(pc.moveTerminatedApp) } > scheduler.(*ClusterContext).handleRMUpdateApplicationEvent { schedApp := > objects.NewApplication(app, ugi, cc.rmEventHandler, request.RmID) } > scheduler.(*Scheduler).handleRMEvent { case ev := <-s.pendingEvents: } > scheduler.(*PartitionContext).GetNode { pc.RLock() } < > scheduler.(*PartitionContext).GetNode { func (pc *PartitionContext) > GetNode(nodeID string) *objects.Node { } > objects.(*Application).tryPlaceholderAllocate { // resource usage should not > change anyway between placeholder and real one at this point } > objects.(*Queue).TryPlaceholderAllocate { for _, app := range > sq.sortApplications(true) { } > objects.(*Queue).TryPlaceholderAllocate { for _, child := range > sq.sortQueues() { } > scheduler.(*PartitionContext).tryPlaceholderAllocate { alloc := > pc.root.TryPlaceholderAllocate(pc.GetNodeIterator, pc.GetNode) } > scheduler.(*ClusterContext).schedule { // nothing reserved that can be > allocated try normal allocate } > scheduler.(*Scheduler).MultiStepSchedule { // Note, this sleep only works in > tests. } > tests.TestDupReleasesInGangScheduling { // and it waits for the shim's > confirmation } > {noformat} > There's no need to have a locked access for {{PartitionContext.nodes}}. The > base implementation of {{NodeCollection}} ({{baseNodeCollection}}) is already > internally synchronized. The "nodes" field is set once. Therefore, no locking > is necessary when accessing it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org