[ https://issues.apache.org/jira/browse/YUNIKORN-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135049#comment-17135049 ]
Weiwei Yang edited comment on YUNIKORN-202 at 6/14/20, 6:06 AM: ---------------------------------------------------------------- Looks like a regression after YUNIKORN-169. When fixing the panic issue in YUNIKORN-169, we do modify the AllocationProposals#UUID field, in cluster_info.go L568 {code:java} // all is confirmed set the UUID in the proposal to pass it back to the scheduler // currently used when an ask is removed while allocation is in flight proposal.UUID = allocInfo.AllocationProto.UUID {code} Right now, this has no lock protected. which causes the data race if a concurrent read in {code:java} func (m *ClusterInfo) HandleEvent(ev interface{}) { switch v := ev.(type) { case *cacheevent.AllocationProposalBundleEvent: enqueueAndCheckFull(m.pendingSchedulerEvents, v) ... } {code} But this is something I do not understand. The workflow should be like # enter scheduling cycle # an allocation proposal is made by the scheduler # send the allocation proposal to cache to process, cluster_info#processAllocationProposalEvent(), this is where READ happens # cache handles the proposal in cluster_info#processAllocationProposalEvent(), this is where WRITE happens Step 4 should happen after step 3, they are not supposed to happen at the same time. [~wilfreds], any thoughts on this? was (Author: wwei): Looks like a regression after YUNIKORN-169. When fixing the panic issue in YUNIKORN-169, we do modify the AllocationProposals#UUID field, in cluster_info.go L568 {code} // all is confirmed set the UUID in the proposal to pass it back to the scheduler // currently used when an ask is removed while allocation is in flight proposal.UUID = allocInfo.AllocationProto.UUID {code} Right now, this has no lock protected. which causes the data race if a concurrent read in {code} func (m *ClusterInfo) HandleEvent(ev interface{}) { switch v := ev.(type) { case *cacheevent.AllocationProposalBundleEvent: enqueueAndCheckFull(m.pendingSchedulerEvents, v) ... } {code} But this is something I do not understand. The workflow should be like # enter scheduling cycle # an allocation proposal is made by the scheduler # send the allocation proposal to cache to process, cluster_info#processAllocationProposalEvent(), this is where READ happens # cache handles the proposal in cluster_info#processAllocationProposalEvent(), this is where WRITE happens Step 4 should happen after step 3, they are not supposed to happen at the same time. [~wilfreds], any thoughts on this? > race detected in TestBasicSchedulerAutoAllocation > ------------------------------------------------- > > Key: YUNIKORN-202 > URL: https://issues.apache.org/jira/browse/YUNIKORN-202 > Project: Apache YuniKorn > Issue Type: Bug > Components: test - unit > Reporter: Wilfred Spiegelenburg > Assignee: Ting Yao,Huang > Priority: Critical > Labels: race-condition > Attachments: data_race.txt > > > This is a almost identical race as we had in YUNIKORN-155: > --- FAIL: TestBasicSchedulerAutoAllocation (0.04s) > testing.go:809: race detected during execution of test > Attaching data race detection logs. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org