[ 
https://issues.apache.org/jira/browse/YUNIKORN-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135049#comment-17135049
 ] 

Weiwei Yang edited comment on YUNIKORN-202 at 6/14/20, 6:06 AM:
----------------------------------------------------------------

Looks like a regression after YUNIKORN-169.

When fixing the panic issue in YUNIKORN-169, we do modify the 
AllocationProposals#UUID field, in cluster_info.go L568
{code:java}
// all is confirmed set the UUID in the proposal to pass it back to the 
scheduler
// currently used when an ask is removed while allocation is in flight
proposal.UUID = allocInfo.AllocationProto.UUID
{code}
Right now, this has no lock protected. which causes the data race if a 
concurrent read in 
{code:java}
func (m *ClusterInfo) HandleEvent(ev interface{}) {
  switch v := ev.(type) {
  case *cacheevent.AllocationProposalBundleEvent:
    enqueueAndCheckFull(m.pendingSchedulerEvents, v)

...

}
{code}
But this is something I do not understand. The workflow should be like
 # enter scheduling cycle
 # an allocation proposal is made by the scheduler
 # send the allocation proposal to cache to process, 
cluster_info#processAllocationProposalEvent(), this is where READ happens 
 # cache handles the proposal in cluster_info#processAllocationProposalEvent(), 
this is where WRITE happens

Step 4 should happen after step 3, they are not supposed to happen at the same 
time.

[~wilfreds], any thoughts on this?


was (Author: wwei):
Looks like a regression after YUNIKORN-169.

When fixing the panic issue in YUNIKORN-169, we do modify the 
AllocationProposals#UUID field, in cluster_info.go L568

{code}

// all is confirmed set the UUID in the proposal to pass it back to the 
scheduler
// currently used when an ask is removed while allocation is in flight
proposal.UUID = allocInfo.AllocationProto.UUID

{code}

Right now, this has no lock protected. which causes the data race if a 
concurrent read in 

{code}

func (m *ClusterInfo) HandleEvent(ev interface{}) {
  switch v := ev.(type) {
  case *cacheevent.AllocationProposalBundleEvent:
    enqueueAndCheckFull(m.pendingSchedulerEvents, v)

...

}

{code}

But this is something I do not understand. The workflow should be like
 # enter scheduling cycle
 # an allocation proposal is made by the scheduler
 # send the allocation proposal to cache to process, 
cluster_info#processAllocationProposalEvent(), this is where READ happens 
 # cache handles the proposal in cluster_info#processAllocationProposalEvent(), 
this is where WRITE happens

Step 4 should happen after step 3, they are not supposed to happen at the same 
time.

[~wilfreds], any thoughts on this?

> race detected in TestBasicSchedulerAutoAllocation
> -------------------------------------------------
>
>                 Key: YUNIKORN-202
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-202
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: test - unit
>            Reporter: Wilfred Spiegelenburg
>            Assignee: Ting Yao,Huang
>            Priority: Critical
>              Labels: race-condition
>         Attachments: data_race.txt
>
>
> This is a almost identical race as we had in YUNIKORN-155:
> --- FAIL: TestBasicSchedulerAutoAllocation (0.04s)
>     testing.go:809: race detected during execution of test
> Attaching data race detection logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org

Reply via email to