[
https://issues.apache.org/jira/browse/MESOS-10006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944592#comment-16944592
]
Meng Zhu commented on MESOS-10006:
----------------------------------
Cross-posting from slack:
thanks for the ticket! Unfortunately, the log does not contain much useful
information. Alas, we did not print out the slaveID upon check failure. Sent
out a patch to print more info upon check failure:
I send out https://reviews.apache.org/r/71581
Consider backport.
Also, some hunch diagnosis: such CHECK failure on sorter function input args
are almost always bugs on the caller side, in this case, most likely some
race/inconsistencies between master and allocator during recovery
> Crash in Sorter: "Check failed: resources.contains(slaveId)"
> ------------------------------------------------------------
>
> Key: MESOS-10006
> URL: https://issues.apache.org/jira/browse/MESOS-10006
> Project: Mesos
> Issue Type: Bug
> Components: master
> Affects Versions: 1.1.0, 1.4.1, 1.9.0
> Environment: Ubuntu Bionic 18.04, Mesos 1.1.0, 1.4.1, 1.9.0 (logs are
> from 1.9.0).
> Reporter: Terra Field
> Priority: Major
> Attachments: mesos-master.log.gz
>
>
> We've hit a similar exception on 3 different versions of the Mesos master
> (the line #/file name changes but the Check failed is the same), usually when
> under very high load:
> {noformat}
> F1003 22:06:54.463502 8579 sorter.hpp:339] Check failed:
> resources.contains(slaveId)
> {noformat}
> This particular occurrence happened after the election of a new master that
> was then stuck doing framework update broadcasts, as documented in
> MESOS-10005.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)