Re: Review Request 48313: Creation and deletion of persistent volumes across agent restart.

Anindya Sinha Thu, 23 Jun 2016 10:08:00 -0700


> On June 8, 2016, 1:28 p.m., Neil Conway wrote:
> > Overall seems like a reasonable approach.
> > 
> > One thing that isn't clear to me: what is the advantage of updating the 
> > checkpoint to reflect any partial work that was done before exiting? It 
> > seems that adds a bunch of complexity and room for error. Why not only 
> > update the checkpoint if all changes were made successfully?
> 
> Anindya Sinha wrote:
>     We would need to maintain what was actually successful in any case since 
> in a DESTROY, a failed rmdir does not lead to the agent exiting. So, if we 
> were to do it at one place, we would still need to keep account of the 
> successful operations so as to not update the checkpoint based on a failed 
> rmdir as an example (and hence can be a partial update).
>     
>     Since we are keeping track of result of the operations anyway, I think it 
> is a good idea to update before exiting (only place we do that when CREATE 
> fails and the agent exits) so that the subsequent handling of 
> CheckpointResources does not need to redo such operations when the agent 
> reregisters.
> 
> Neil Conway wrote:
>     On reflection, I wonder whether we should be handling `CREATE` errors 
> differently from `DESTROY` errors. In both cases, the user has asked the 
> agent to do something it wasn't able to do. A failed `DESTROY` has the 
> addditional concern that we might have destroyed some but not all of the data 
> on the volume.
>     
>     Do you think handling `CREATE` vs. `DESTROY` errors differently is a good 
> idea?
> 
> Anindya Sinha wrote:
>     Good point. Here is what I think are the use cases:
>     Say we have checkpointed resources (CR) as {V1,V2} where V1, V2 are 2 
> persistent volumes. So, CR(master) = {V1,V2}, and CR(agent) = {V1,V2}.
>     If we now receive a DESTROY(V2): CR(master) = {V1} [since master's view 
> is not dependent on what happened on the agent]. Suppose that fails on the 
> agent, so CR(agent) = {V1,V2} [since we do not update checkpoint resources on 
> agent on failure in DESTROY, which results in inconsistency between master 
> and agent at this point of time].
>     
>     Case 1 (current implementation): Agent does not restart on failure in 
> DESTROY. Hence, CR(agent) = {V1,V2}. When the next CheckpointResources is 
> received, ie. on a subsequent CREATE/DESTROY/RESERVE/UNRESERVE on a different 
> resource, DESTROY(V2) will be reattempted and if that is successful, we will 
> in sync between agent and master. However if the next CheckpointResources is 
> due to a CREATE(V2) [that can happen since V2 is available as a resource 
> based on offer from master], that would be a no-op on agent since agent does 
> not treat it as a new resource based on the checkpoint since at this point 
> CR(master) = {V1,V2}, and CR(agent) = {V1,V2}, which would be a problem.
>     
>     Case 2 (if we exit agent on failure): The agent restarts which triggers a 
> CheckpointResources from master->agent on ReregisterSlave. That would force a 
> reattempt of DESTROY(V2) since current view is CR(master) = {V1} and 
> CR(agent) = {V1,V2} which will reattempt to bring the checkpointed resources 
> back in sync between master and agent.
>     
>     So, I think it might be a better option to exit the agent on failure in 
> DESTROY as well. However, I think we should still update the checkpoint based 
> on the status of successful operations (other CREATE/DESTROY) on failure 
> (when agent exits) so as to avoid these operations to be repeated in a 
> subsequent CheckpointResources message. Does that sound reasonable to you?
>     
>     Note: I think this use case probably is a good example to consider 
> StatusUpdates (or something similar) for operations on reserved resources, 
> viz. CREATE/DESTROY/RESERVE/UNRESERVE to ensure master and agent are in sync 
> to ensure guaranteed view of offers (to frameworks) for reserved resources.
> 
> Neil Conway wrote:
>     Thanks for the writeup! Makes sense.
>     
>     If we cause the agent to exit on DESTROY failure, if we only re-apply 
> DESTROYs at `ReregisterSlave`, it seems to me there is still a window in 
> which another `CREATE` can be applied at the master. That would mean we 
> wouldn't reapply the `DESTROY`, which would be bad.
>     
>     My concern about updating the checkpointed resources to reflect *partial* 
> results is that (a) it seems unnecessary, (b) it means the checkpointed 
> resources don't reflect either the *original* or the *desired* state of the 
> agent, which seems problematic.
>     
>     What about the following: when an agent gets a 
> `CheckpointedResourcesMessage` that differs from its current state, it first 
> writes the desired resource state to a new checkpoint file, 
> `checkpoint.target`. Then it tries to apply the delta between 
> `checkpoint.target` and the current checkpoint. If the on-disk state at the 
> agent is updated successfully to match `checkpoint.target`, we rename 
> `checkpoint.target` -> `checkpoint` and we're done. Otherwise, if any I/O 
> operation fails, we exit the agent. Then on restart, we check for the 
> existence of `checkpoint.target` *before* we try to reregister with the 
> master; if `checkpoint.target` != `checkpoint`, we resume trying to apply the 
> delta between `checkpoint.target` and `checkpoint`; if that fails we exit 
> again, and if it succeeds we rename `checkpoint.target` -> `checkpoint` and 
> then continue to reregister with the master.
>     
>     Let me know what you think.
> 
> Anindya Sinha wrote:
>     > If we cause the agent to exit on DESTROY failure, if we only re-apply 
> DESTROYs at ReregisterSlave, it seems to me there is still a window in which 
> another CREATE can be applied at the master. That would mean we wouldn't 
> reapply the DESTROY, which would be bad.
>     
>     Agreed, but I think we agree that the window of failure is very small. 
> The agent needs to fail during processing of a `CheckpointResourcesMessage` 
> pertaining to a DELETE, an offer goes out to a framework with the disk space 
> as available, and a CREATE for the same volume and the same persistence id is 
> received prior to the successful reregister of the agent.
>     
>     > What about the following: when an agent gets a 
> CheckpointedResourcesMessage that differs from its current state, it first 
> writes the desired resource state to a new checkpoint file, 
> checkpoint.target. Then it tries to apply the delta between checkpoint.target 
> and the current checkpoint. If the on-disk state at the agent is updated 
> successfully to match checkpoint.target, we rename checkpoint.target -> 
> checkpoint and we're done. Otherwise, if any I/O operation fails, we exit the 
> agent. Then on restart, we check for the existence of checkpoint.target 
> before we try to reregister with the master; if checkpoint.target != 
> checkpoint, we resume trying to apply the delta between checkpoint.target and 
> checkpoint; if that fails we exit again, and if it succeeds we rename 
> checkpoint.target -> checkpoint and then continue to reregister with the 
> master.
>     
>     This would work as far as retrying DESTROY on the agent is concerned; 
> however, this would still continue to offer resources to frameworks that are 
> not available. If DESTROY fails (continuously even after multiple retries), 
> the master has already offered that resource as available to the framework 
> which seems incorrect. I think we need a longer term discussion on something 
> similar to StatusUpdates for reserved resources (agent->master->framework) so 
> that such a resource is offered to the framework only after the agent 
> successfully processes it.
>     
>     I will post an update to this RR to address the remaining comments, and 
> modify it as appropriate as we decide based on the current thread.
> 
> Neil Conway wrote:
>     > I think we agree that the window of failure is very small.
>     
>     The window is indeed small, but it is still possible. If we can eliminate 
> the window of failure entirely without too much complexity, I think that is 
> worth investigating.
>     
>     > If DESTROY fails (continuously even after multiple retries), the master 
> has already offered that resource as available to the framework which seems 
> incorrect.
>     
>     Indeed -- although the master can offer resources that are unavailable in 
> general (e.g., master offers resource at an agent and concurrently the agent 
> crashes; task launches will return TASK_LOST). I completely agree that we 
> want something akin to status updates + reconciliation for reliably making 
> reservations and dynamic volumes -- +1 to the longer-term discussion.
> 
> Anindya Sinha wrote:
>     Sounds good.
> 
> Jiang Yan Xu wrote:
>     > If we cause the agent to exit on DESTROY failure, if we only re-apply 
> DESTROYs at ReregisterSlave, it seems to me there is still a window in which 
> another CREATE can be applied at the master. That would mean we wouldn't 
> reapply the DESTROY, which would be bad.
>     
>     As I was thinking about this: if the failure to destroy the volume is due 
> to the agent not receiving the initial CheckpointedResourcesMessage but while 
> it is temporarily partiioned (master still thinks its connected) another 
> CREATE of the same volume is applied at the master and then the agent 
> reregisters, it's only going to receive the 2nd CheckpointedResourcesMessage 
> which would look identitcal to its local checkpointed resources and thus no 
> action is taken to clean up the volume... this case is not going to solved by 
> the proposal here.


That is correct. That scenario will leak the contents of the volume to another 
framework.
ISTM that cannot be avoided as long as we do not make 
`CheckpointResourceMessage` have the notions of guaranteed delivery, ie. update 
master (specifically allocator) only when it knows that the changes have been 
applied successfully for certain on the agent.

That situation exists in the current world as well. I think this proposal 
handles inconsistencies around failure conditions on agent, but does not quite 
address all cases of non-delivery of `CheckpointResourcesMessage`.


- Anindya


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48313/#review136638
-----------------------------------------------------------


On June 20, 2016, 11:41 p.m., Anindya Sinha wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48313/
> -----------------------------------------------------------
> 
> (Updated June 20, 2016, 11:41 p.m.)
> 
> 
> Review request for mesos, Neil Conway and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-5448
>     https://issues.apache.org/jira/browse/MESOS-5448
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When the agent receives CheckpointedResourcesMessage, we store the
> target checkpoint on disk. On successful create and destroy of
> persistent volumes as a part of handling this messages, we commit
> the checkpoint on the disk, and clear the target checkpoint.
> 
> However, incase of any failure we do not commit the checkpoint to
> disk, and exit the agent. When the agent restarts and there is a
> target checkpoint present on disk which differs from the committed
> checkpoint, we retry to sync the target and committed checkpoint.
> On success, we reregister the agent with the master, but in case it
> fails, we do not commit the checkpoint and the agent exists.
> 
> 
> Diffs
> -----
> 
>   src/slave/paths.hpp 339e539863c678b6ed4d4670d75c7ff4c54daa79 
>   src/slave/paths.cpp 03157f93b1e703006f95ef6d0a30afae375dcdb5 
>   src/slave/slave.hpp 58ff2bfac6918d989ab36b67cf6ba2f3657c8356 
>   src/slave/slave.cpp 4bf01f2b020f5e975fb57cffcd19865d7431eac2 
>   src/slave/state.hpp 0de2a4ee4fabaad612c4526166157b001c380bdb 
>   src/slave/state.cpp 04c3d42040f186507a0e484a3ee616f1b1a77ea8 
> 
> Diff: https://reviews.apache.org/r/48313/diff/
> 
> 
> Testing
> -------
> 
> All tests passed.
> 
> 
> Thanks,
> 
> Anindya Sinha
> 
>

Re: Review Request 48313: Creation and deletion of persistent volumes across agent restart.

Reply via email to