[ https://issues.apache.org/jira/browse/MESOS-7566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zhitao Li updated MESOS-7566: ----------------------------- Description: A check in [sorter.cpp#L355 in 1.1.2 | https://github.com/apache/mesos/blob/1.1.2/src/master/allocator/sorter/drf/sorter.cpp#L355] is triggered occasionally in our cluster and crashes the master leader. I manually modified that check to print out the related variables, and the following is a master log. https://gist.github.com/zhitaoli/0662d9fe1f6d57de344951c05b536bad#file-gistfile1-txt >From the log, it seems like the check was using an stale value revocable CPU >{{26}} while the new value was updated to 25, thus the check crashed. So far two verified occurrence of this bug are both observed near an {{UNRESERVE}} operation (see lines above in the log). was: A check in [https://github.com/apache/mesos/blob/1.1.2/src/master/allocator/sorter/drf/sorter.cpp#L355 DRFSorter] is triggered occasionally in our cluster and crashes the master leader. I manually modified that check to print out the related variables, and the following is a master log. https://gist.github.com/zhitaoli/0662d9fe1f6d57de344951c05b536bad#file-gistfile1-txt >From the log, it seems like the check was using an stale value of >{{cpus(*){REV}:26}} while the new value was updated to {{cpus(*){REV}:25}}, >thus it crashed. So far two verified occurrence of this bug are both observed near an {{UNRESERVE}} operation (see lines above in the log). > Master crash due to failed check in DRFSorter::remove > ----------------------------------------------------- > > Key: MESOS-7566 > URL: https://issues.apache.org/jira/browse/MESOS-7566 > Project: Mesos > Issue Type: Bug > Affects Versions: 1.1.1, 1.1.2 > Reporter: Zhitao Li > Priority: Critical > > A check in [sorter.cpp#L355 in 1.1.2 | > https://github.com/apache/mesos/blob/1.1.2/src/master/allocator/sorter/drf/sorter.cpp#L355] > is triggered occasionally in our cluster and crashes the master leader. > I manually modified that check to print out the related variables, and the > following is a master log. > https://gist.github.com/zhitaoli/0662d9fe1f6d57de344951c05b536bad#file-gistfile1-txt > From the log, it seems like the check was using an stale value revocable CPU > {{26}} while the new value was updated to 25, thus the check crashed. > So far two verified occurrence of this bug are both observed near an > {{UNRESERVE}} operation (see lines above in the log). -- This message was sent by Atlassian JIRA (v6.3.15#6346)