Meng Zhu created MESOS-9460: ------------------------------- Summary: Speculative operations may make master and agent resource views out of sync. Key: MESOS-9460 URL: https://issues.apache.org/jira/browse/MESOS-9460 Project: Mesos Issue Type: Bug Affects Versions: 1.7.0, 1.6.1, 1.5.1 Reporter: Meng Zhu
This bug could happen with the following sequence of events: - agent (re)registers with the master - speculative operation calls are made to the master - the allocator is speculatively updated in https://github.com/apache/mesos/blob/master/src/master/master.cpp#L11315 - before agent resource gets updated, it sends `UpdateSlaveMessage` when getting the (re)registered message if it has the capability `RESOURCE_PROVIDER` or oversubscription is used (https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L1551 and https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L1633) - the `UpdateSlaveMessage` triggers allocator to update the total resources with STALE info sent from the agent https://github.com/apache/mesos/blob/master/src/master/master.cpp#L8205, thus the update from the previous operation is overwritten and LOST - agent finishes the operation and informs the master through `UpdateOperationStatusMessage` but for the speculative operation, we do not update the allocator https://github.com/apache/mesos/blob/master/src/master/master.cpp#L11177 - The resource views of master and agent are out of sync. This caused MESOS-7971 and likely MESOS-9458 as well. [~chhsia0] proposes to use `resource_version_uuid` to fix this (https://issues.apache.org/jira/browse/MESOS-7971?focusedCommentId=16712278&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16712278). -- This message was sent by Atlassian JIRA (v7.6.3#76005)