[jira] [Created] (YUNIKORN-2559) DATA RACE: EventStore.Store() and Context.PublishEvents()

2024-04-15 Thread Yu-Lin Chen (Jira)
Yu-Lin Chen created YUNIKORN-2559:
-

 Summary:  DATA RACE: EventStore.Store() and Context.PublishEvents()
 Key: YUNIKORN-2559
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2559
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: core - common
Reporter: Yu-Lin Chen
Assignee: Yu-Lin Chen
 Attachments: shim-racing-log.txt

How to reproduce:
 # In shim, update core version to the latest version 
(v0.0.0-20240415111844-72540e2b277f)
 # go mod tidy
 # Run 'make test > shim-racing-log.txt'

 
{code:java}
WARNING: DATA RACE
Write at 0x00c003882008 by goroutine 59:
  github.com/apache/yunikorn-core/pkg/events.(*EventStore).Store()
      
/home/chenyulin0719/go/pkg/mod/github.com/apache/yunikorn-core@v0.0.0-20240415111844-72540e2b277f/pkg/events/event_store.go:59
 +0x1aa
  
github.com/apache/yunikorn-core/pkg/events.(*EventSystemImpl).StartServiceWithPublisher.func2()
      
/home/chenyulin0719/go/pkg/mod/github.com/apache/yunikorn-core@v0.0.0-20240415111844-72540e2b277f/pkg/events/event_system.go:194
 +0x167Previous read at 0x00c003882008 by goroutine 60:
  github.com/apache/yunikorn-k8shim/pkg/cache.(*Context).PublishEvents()
      /home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/cache/context.go:1176 
+0x97
  github.com/apache/yunikorn-k8shim/pkg/cache.(*AsyncRMCallback).SendEvent()
      
/home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/cache/scheduler_callback.go:235
 +0xec
  
github.com/apache/yunikorn-core/pkg/events.(*EventPublisher).StartService.func1()
      
/home/chenyulin0719/go/pkg/mod/github.com/apache/yunikorn-core@v0.0.0-20240415111844-72540e2b277f/pkg/events/event_publisher.go:60
 +0x27dGoroutine 59 (running) created at:
  
github.com/apache/yunikorn-core/pkg/events.(*EventSystemImpl).StartServiceWithPublisher()
      
/home/chenyulin0719/go/pkg/mod/github.com/apache/yunikorn-core@v0.0.0-20240415111844-72540e2b277f/pkg/events/event_system.go:183
 +0x287
  github.com/apache/yunikorn-core/pkg/events.(*EventSystemImpl).StartService()
      
/home/chenyulin0719/go/pkg/mod/github.com/apache/yunikorn-core@v0.0.0-20240415111844-72540e2b277f/pkg/events/event_system.go:166
 +0x2b
  
github.com/apache/yunikorn-core/pkg/entrypoint.startAllServicesWithParameters()
      
/home/chenyulin0719/go/pkg/mod/github.com/apache/yunikorn-core@v0.0.0-20240415111844-72540e2b277f/pkg/entrypoint/entrypoint.go:80
 +0x9b
  github.com/apache/yunikorn-core/pkg/entrypoint.StartAllServices()
      
/home/chenyulin0719/go/pkg/mod/github.com/apache/yunikorn-core@v0.0.0-20240415111844-72540e2b277f/pkg/entrypoint/entrypoint.go:43
 +0x59
  github.com/apache/yunikorn-k8shim/pkg/shim.(*MockScheduler).init()
      
/home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/shim/scheduler_mock_test.go:63 
+0xad
  github.com/apache/yunikorn-k8shim/pkg/shim.TestApplicationScheduling()
      
/home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/shim/scheduler_test.go:60 +0x8c
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1689 +0x21e
  testing.(*T).Run.gowrap1()
      /usr/local/go/src/testing/testing.go:1742 +0x44Goroutine 60 (running) 
created at:
  github.com/apache/yunikorn-core/pkg/events.(*EventPublisher).StartService()
      
/home/chenyulin0719/go/pkg/mod/github.com/apache/yunikorn-core@v0.0.0-20240415111844-72540e2b277f/pkg/events/event_publisher.go:50
 +0xc4
  
github.com/apache/yunikorn-core/pkg/events.(*EventSystemImpl).StartServiceWithPublisher()
      
/home/chenyulin0719/go/pkg/mod/github.com/apache/yunikorn-core@v0.0.0-20240415111844-72540e2b277f/pkg/events/event_system.go:203
 +0x2b8
  github.com/apache/yunikorn-core/pkg/events.(*EventSystemImpl).StartService()
      
/home/chenyulin0719/go/pkg/mod/github.com/apache/yunikorn-core@v0.0.0-20240415111844-72540e2b277f/pkg/events/event_system.go:166
 +0x2b
  
github.com/apache/yunikorn-core/pkg/entrypoint.startAllServicesWithParameters()
      
/home/chenyulin0719/go/pkg/mod/github.com/apache/yunikorn-core@v0.0.0-20240415111844-72540e2b277f/pkg/entrypoint/entrypoint.go:80
 +0x9b
  github.com/apache/yunikorn-core/pkg/entrypoint.StartAllServices()
      
/home/chenyulin0719/go/pkg/mod/github.com/apache/yunikorn-core@v0.0.0-20240415111844-72540e2b277f/pkg/entrypoint/entrypoint.go:43
 +0x59
  github.com/apache/yunikorn-k8shim/pkg/shim.(*MockScheduler).init()
      
/home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/shim/scheduler_mock_test.go:63 
+0xad
  github.com/apache/yunikorn-k8shim/pkg/shim.TestApplicationScheduling()
      
/home/chenyulin0719/yunikorn/yunikorn-k8shim/pkg/shim/scheduler_test.go:60 +0x8c
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1689 +0x21e
  testing.(*T).Run.gowrap1()
      /usr/local/go/src/testing/testing.go:1742 +0x44
== {code}
{*}Root cause{*}:

EventStore.events were read/write by 2 different goroutine.
 * Goroutine 1: EventPublisher.StartService() call 

[jira] [Created] (YUNIKORN-2558) Remove redundent conditional

2024-04-15 Thread Hsien-Cheng(Ryan) Huang (Jira)
Hsien-Cheng(Ryan) Huang created YUNIKORN-2558:
-

 Summary: Remove redundent conditional
 Key: YUNIKORN-2558
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2558
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: shim - kubernetes
Reporter: Hsien-Cheng(Ryan) Huang
 Fix For: 1.5.0


The script skips the {{make tools}} if the tools folder is existent. That is a 
bit weird that we don't check all tools in the folder. The script should call 
{{make tools}} anyway, and let {{make tools}} do the check and install.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2552) Recursive locking when sending remove queue event

2024-04-15 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2552.

Fix Version/s: 1.6.0
   1.5.1
   Resolution: Fixed

> Recursive locking when sending remove queue event
> -
>
> Key: YUNIKORN-2552
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2552
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0, 1.5.1
>
>
> When sending a queue event from {{queueEvents}}, we acquire the read lock 
> again. 
> {noformat}
> objects.(*Queue).IsManaged { sq.RLock() } <
> objects.(*Queue).IsManaged { func (sq *Queue) IsManaged() bool { }
> objects.(*queueEvents).sendRemoveQueueEvent { } }
> objects.(*Queue).RemoveQueue { sq.queueEvents.sendRemoveQueueEvent() }
> scheduler.(*partitionManager).cleanQueues { // all OK update the queue 
> hierarchy and partition }
> scheduler.(*partitionManager).cleanQueues { if children := 
> queue.GetCopyOfChildren(); len(children) != 0 { }
> scheduler.(*partitionManager).cleanRoot { 
> manager.cleanQueues(manager.pc.root) }
> {noformat}
> {{RemoveQueue()}} already has the read lock.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2550) Fix locking in PartitionContext

2024-04-15 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2550.

 Fix Version/s: 1.6.0
1.5.1
Target Version: 1.6.0, 1.5.1
Resolution: Fixed

> Fix locking in PartitionContext
> ---
>
> Key: YUNIKORN-2550
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2550
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - common
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0, 1.5.1
>
>
> Possible deadlock was detected:
> {noformat}
> placement.(*AppPlacementManager).initialise { m.Lock() } <
> placement.(*AppPlacementManager).initialise { } }
> placement.(*AppPlacementManager).UpdateRules { 
> log.Log(log.Config).Info("Building new rule list for placement manager") }
> scheduler.(*PartitionContext).updatePartitionDetails { err := 
> pc.placementManager.UpdateRules(conf.PlacementRules) }
> scheduler.(*ClusterContext).updateSchedulerConfig { err = 
> part.updatePartitionDetails(p) }
> scheduler.(*ClusterContext).processRMConfigUpdateEvent { err = 
> cc.updateSchedulerConfig(conf, rmID) }
> scheduler.(*Scheduler).handleRMEvent { case *rmevent.RMConfigUpdateEvent: }
> scheduler.(*PartitionContext).GetQueue { pc.RLock() } <
> scheduler.(*PartitionContext).GetQueue { func (pc *PartitionContext) 
> GetQueue(name string) *objects.Queue { }
> placement.(*providedRule).placeApplication { // if we cannot create the queue 
> must exist }
> placement.(*AppPlacementManager).PlaceApplication { queueName, err = 
> checkRule.placeApplication(app, m.queueFn) }
> scheduler.(*PartitionContext).AddApplication { err := 
> pc.getPlacementManager().PlaceApplication(app) }
> scheduler.(*ClusterContext).handleRMUpdateApplicationEvent { schedApp := 
> objects.NewApplication(app, ugi, cc.rmEventHandler, request.RmID) }
> scheduler.(*Scheduler).handleRMEvent { case ev := <-s.pendingEvents: }
> {noformat}
> Lock order is different between {{PartitionContext}} and 
> {{AppPlacementManager}}.
> There's also an interference between {{PartitionContext}} and an 
> {{Application}} object:
> {noformat}
> objects.(*Application).SetTerminatedCallback { sa.Lock() } <
> objects.(*Application).SetTerminatedCallback { func (sa *Application) 
> SetTerminatedCallback(callback func(appID string)) { }
> scheduler.(*PartitionContext).AddApplication { 
> app.SetTerminatedCallback(pc.moveTerminatedApp) }
> scheduler.(*ClusterContext).handleRMUpdateApplicationEvent { schedApp := 
> objects.NewApplication(app, ugi, cc.rmEventHandler, request.RmID) }
> scheduler.(*Scheduler).handleRMEvent { case ev := <-s.pendingEvents: }
> scheduler.(*PartitionContext).GetNode { pc.RLock() } <
> scheduler.(*PartitionContext).GetNode { func (pc *PartitionContext) 
> GetNode(nodeID string) *objects.Node { }
> objects.(*Application).tryPlaceholderAllocate { // resource usage should not 
> change anyway between placeholder and real one at this point }
> objects.(*Queue).TryPlaceholderAllocate { for _, app := range 
> sq.sortApplications(true) { }
> objects.(*Queue).TryPlaceholderAllocate { for _, child := range 
> sq.sortQueues() { }
> scheduler.(*PartitionContext).tryPlaceholderAllocate { alloc := 
> pc.root.TryPlaceholderAllocate(pc.GetNodeIterator, pc.GetNode) }
> scheduler.(*ClusterContext).schedule { // nothing reserved that can be 
> allocated try normal allocate }
> scheduler.(*Scheduler).MultiStepSchedule { // Note, this sleep only works in 
> tests. }
> tests.TestDupReleasesInGangScheduling { // and it waits for the shim's 
> confirmation }
> {noformat}
> There's no need to have a locked access for {{PartitionContext.nodes}}. The 
> base implementation of {{NodeCollection}} ({{baseNodeCollection}}) is already 
> internally synchronized. The "nodes" field is set once. Therefore, no locking 
> is necessary when accessing it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org