----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/63797/#review190972 -----------------------------------------------------------
src/slave/slave.cpp Lines 3568-3574 (original), 3573-3578 (patched) <https://reviews.apache.org/r/63797/#comment268564> Hum, what if the checkpoint of the `TargetPath` succeeded but the commit failed? Should we delete the `TargetPath` so that it'll not be retried after agent failover? What if the removal also fails? This is indeed quite tricky. The reason that the apple folks did the prepare+commit thing is to make sure master and agent are in sync in case of a failed old operation. That's exactly the problem we're trying to solve here. That makes me wondering if we still need this prepare+commit style checkpointing or not for the `ApplyOfferOperationMessage` path (this is guaranteed to be a new master). Also, we might need to checkpoint offer operations along with total resources atomically for agent default resources, that means we have to use a different checkpoint file for that. Based on that, my suggestion is that we don't touch the original `checkpointResources` method. Instead, let's introduce a new one that don't do this prepare+commit style checkpointing. Also, is this strictly required for our MVP. if not, I'd suggest we deal with that later. - Jie Yu On Nov. 14, 2017, 2:11 p.m., Jan Schlicht wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/63797/ > ----------------------------------------------------------- > > (Updated Nov. 14, 2017, 2:11 p.m.) > > > Review request for mesos, Benjamin Bannier and Jie Yu. > > > Bugs: MESOS-8211 > https://issues.apache.org/jira/browse/MESOS-8211 > > > Repository: mesos > > > Description > ------- > > With offer operation handling an agent can send feedback to the master > when checkpointing fails. > Old masters will still send a 'CheckpointResourcesMessage', a wrapper > has been added that fails over the agent when checkpointing fails. > As before this will result in an agent re-registration and > reconciliation of resources. > > > Diffs > ----- > > src/slave/slave.hpp c0acaa639a2bacaa6955ae6c5ab41e75dc1d11f7 > src/slave/slave.cpp d8bacebc74790e955490a158c37ac0d9e75fd6b5 > > > Diff: https://reviews.apache.org/r/63797/diff/1/ > > > Testing > ------- > > make check > > > Thanks, > > Jan Schlicht > >