[ 
https://issues.apache.org/jira/browse/MESOS-7566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-7566:
-----------------------------
    Description: 
A check in [sorter.cpp#L355 in 1.1.2 | 
https://github.com/apache/mesos/blob/1.1.2/src/master/allocator/sorter/drf/sorter.cpp#L355]
 is triggered occasionally in our cluster and crashes the master leader.

I manually modified that check to print out the related variables, and the 
following is a master log.

https://gist.github.com/zhitaoli/0662d9fe1f6d57de344951c05b536bad#file-gistfile1-txt

>From the log, it seems like the check was using an stale value revocable CPU  
>{{26}} while the new value was updated to 25, thus the check crashed.

So far two verified occurrence of this bug are both observed near an 
{{UNRESERVE}} operation (see lines above in the log).

  was:
A check in 
[https://github.com/apache/mesos/blob/1.1.2/src/master/allocator/sorter/drf/sorter.cpp#L355
 DRFSorter] is triggered occasionally in our cluster and crashes the master 
leader.

I manually modified that check to print out the related variables, and the 
following is a master log.

https://gist.github.com/zhitaoli/0662d9fe1f6d57de344951c05b536bad#file-gistfile1-txt

>From the log, it seems like the check was using an stale value of 
>{{cpus(*){REV}:26}} while the new value was updated to {{cpus(*){REV}:25}}, 
>thus it crashed.

So far two verified occurrence of this bug are both observed near an 
{{UNRESERVE}} operation (see lines above in the log).


> Master crash due to failed check in DRFSorter::remove
> -----------------------------------------------------
>
>                 Key: MESOS-7566
>                 URL: https://issues.apache.org/jira/browse/MESOS-7566
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.1.1, 1.1.2
>            Reporter: Zhitao Li
>            Priority: Critical
>
> A check in [sorter.cpp#L355 in 1.1.2 | 
> https://github.com/apache/mesos/blob/1.1.2/src/master/allocator/sorter/drf/sorter.cpp#L355]
>  is triggered occasionally in our cluster and crashes the master leader.
> I manually modified that check to print out the related variables, and the 
> following is a master log.
> https://gist.github.com/zhitaoli/0662d9fe1f6d57de344951c05b536bad#file-gistfile1-txt
> From the log, it seems like the check was using an stale value revocable CPU  
> {{26}} while the new value was updated to 25, thus the check crashed.
> So far two verified occurrence of this bug are both observed near an 
> {{UNRESERVE}} operation (see lines above in the log).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to