[ https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neil Conway updated MESOS-3870: ------------------------------- Comment: was deleted (was: You mean "volatile"? The variable is read and written inside a "synchronized" block, which will do the necessary synchronization (memory barriers) to ensure that other CPUs see the appropriate values (provided they also use synchronized blocks when examining the variable). There are a few places that read "ProcessBase.state" without holding the mutex (e.g., ProcessManager::resume()) -- that is probably unsafe and should be fixed. (Note that "volatile" is not sufficient/appropriate for ensuring reasonable semantics for concurrent access to shared state without mutual exclusion, anyway...)) > Prevent out-of-order libprocess message delivery > ------------------------------------------------ > > Key: MESOS-3870 > URL: https://issues.apache.org/jira/browse/MESOS-3870 > Project: Mesos > Issue Type: Bug > Components: libprocess > Reporter: Neil Conway > Priority: Minor > Labels: mesosphere > > I was under the impression that {{send()}} provided in-order, unreliable > message delivery. So if P1 sends <M1,M2> to P2, P2 might see <>, <M1>, <M2>, > or <M1,M2> — but not <M2,M1>. > I suspect much of the code makes a similar assumption. However, it appears > that this behavior is not guaranteed. slave.cpp:2217 has the following > comment: > {noformat} > // TODO(jieyu): Here we assume that CheckpointResourcesMessages are > // ordered (i.e., slave receives them in the same order master sends > // them). This should be true in most of the cases because TCP > // enforces in order delivery per connection. However, the ordering > // is technically not guaranteed because master creates multiple > // connections to the slave in some cases (e.g., persistent socket > // to slave breaks and master uses ephemeral socket). This could > // potentially be solved by using a version number and rejecting > // stale messages according to the version number. > {noformat} > We can improve this situation by _either_: (1) fixing libprocess to guarantee > ordered message delivery, e.g., by adding a sequence number, or (2) > clarifying that ordered message delivery is not guaranteed, and ideally > providing a tool to force messages to be delivered out-of-order. -- This message was sent by Atlassian JIRA (v6.3.4#6332)