Re: Review Request 48313: Creation and deletion of persistent volumes across agent restart.

Anindya Sinha Fri, 01 Jul 2016 14:40:35 -0700


> On June 20, 2016, 5:16 p.m., Jiang Yan Xu wrote:
> > In terms of testing, if we don't crash the agent within 
> > `syncCheckpointedResources()` but rather return a failure when its fails 
> > during recovery, we can capture this in `Slave::__recover` and verify the 
> > failed future right?
> 
> Jiang Yan Xu wrote:
>     Echoing Neil on /r/48315/, can we add a test?


Added in https://reviews.apache.org/r/48315/.


> On June 20, 2016, 5:16 p.m., Jiang Yan Xu wrote:
> > src/slave/slave.cpp, lines 2520-2521
> > <https://reviews.apache.org/r/48313/diff/4/?file=1420538#file1420538line2520>
> >
> >     If this fails, we don't know how it has failed and if the agent crashes 
> > next and restarts, it's possible for the agent to recover some non-empty 
> > resources from the target and erroneously starts deleting directories. 
> > Possible?
> 
> Anindya Sinha wrote:
>     is it possible? `os::rm()` for files calls `unlink()` which clears the 
> inode entry for that file (assuming no other links to that file), although 
> the data may still exist on disk (until another write operation overwrites 
> the actual bits). And isn't clearing the inode entry an atomic operation? So 
> I do not think there is a case where the data is partial.
> 
> Jiang Yan Xu wrote:
>     I was referring to the use of 
> `state::checkpoint(paths::getResourcesTargetPath(metaDir), Resources())` to 
> clean up target checkpointed resources. If we use `os::rm()` it should be no 
> problem.

Ok. Using os::rm().


> On June 20, 2016, 5:16 p.m., Jiang Yan Xu wrote:
> > src/slave/slave.cpp, lines 2552-2553
> > <https://reviews.apache.org/r/48313/diff/4/?file=1420538#file1420538line2552>
> >
> >     Can we directly return in case of an error instead of using variables 
> > like this?
> 
> Anindya Sinha wrote:
>     There are 3 error conditions here so to consolidate return from one place 
> which I believe is better than return with the same failures from 3 places... 
> hence, the variables. I can certainly consolidate `isFailed` and 
> `additionalReason` into `Option<string> additionalReason = None()` to capture 
> any errors (and not need `isFailed` in this case).
>     
>     I just realized this RR has 1 condition only, but when the changes for RR 
> 48315 is added, there are 3 conditions. Let me know if you want me to remove 
> `additionalReason` from this RR and add it in RR 48315. Again, all the RRs 
> 48313 through to 48315 go together.
> 
> Jiang Yan Xu wrote:
>     The thing is that the 3 places don't return the same failure, right? They 
> are errors from three different actions. Combining with /r/48315/, the three 
> errors are (forgive my line length):
>     
>     ```
>     return Error("Failed to create the persistent volume" + 
> volume.disk().persistence().id() + "at '" + path + "': " + mkdir.error());
>     
>     return Error("Failed to determine the disk status of persistent volume" + 
> volume.disk().persistence().id() + "at '" + path + "': " + empty.error());
>     
>     return Error("New persistent volume" + volume.disk().persistence().id() + 
> "at '" + path + "' already exists on disk and isn't empty"
>     ```
>     
>     The only thing they share is `"persistent volume " + 
> volume.disk().persistence().id() + "at '" + path + "'` so I would use a 
> variable for that:
>     
>     `string volumeDescription = "persistent volume " + 
> volume.disk().persistence().id() + "at '" + path + "'`;
>     
>     As a reader I think this is clearer than tracing 
> `additionalReason.isSome()` or `isFailed` in multiple branches to determine 
> where they lead to.
>     
>     Speaking of the relationship between /r/48313/ and /r/48315/, mutliple 
> places in this review point to the later review and you have to use a TODO. 
> Would it be cleaner to pull out /r/48314/ as the first review in the chain 
> and combine the other two? Up to you but if it's more effort to seperate 
> reviews cleanly then leaving them as one is OK too (and it makes it easier to 
> review).

TBH, I do not see how using `additionalReason` to return `Error()` from a 
single place is confusing at all. The multiple branches are setting the actual 
error which is returned from a single place which I think is a very common 
pattern.
Obviously, your recommendation of returning `Error()` directly from the 3 
conditions is fine also.
To avoid unnecessary to and fro on this, I updated the code though.

Regarding the sequence of RRs, let us keep the 3 RRs like we have been doing.


> On June 20, 2016, 5:16 p.m., Jiang Yan Xu wrote:
> > src/slave/slave.cpp, lines 2583-2591
> > <https://reviews.apache.org/r/48313/diff/4/?file=1420538#file1420538line2583>
> >
> >     This comment is supposed to be revised together with /r/48315/?
> 
> Anindya Sinha wrote:
>     Since RRs 48313 through to 48315 go together, do  you want to change this 
> comment in this RR?
> 
> Jiang Yan Xu wrote:
>     I should say that the comment about "directory should be empty" seems to 
> not only belong to review /r/48315/ but also right above where the code is? 
> How about just remove the TODO here?

Ok. Removed this comment from the `DESTROY` section. The `CREATE` section has a 
comment as to why the directory may exist even if it is a new `CREATE` 
operation.


- Anindya


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48313/#review138314
-----------------------------------------------------------


On June 20, 2016, 11:41 p.m., Anindya Sinha wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48313/
> -----------------------------------------------------------
> 
> (Updated June 20, 2016, 11:41 p.m.)
> 
> 
> Review request for mesos, Neil Conway and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-5448
>     https://issues.apache.org/jira/browse/MESOS-5448
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When the agent receives CheckpointedResourcesMessage, we store the
> target checkpoint on disk. On successful create and destroy of
> persistent volumes as a part of handling this messages, we commit
> the checkpoint on the disk, and clear the target checkpoint.
> 
> However, incase of any failure we do not commit the checkpoint to
> disk, and exit the agent. When the agent restarts and there is a
> target checkpoint present on disk which differs from the committed
> checkpoint, we retry to sync the target and committed checkpoint.
> On success, we reregister the agent with the master, but in case it
> fails, we do not commit the checkpoint and the agent exists.
> 
> 
> Diffs
> -----
> 
>   src/slave/paths.hpp 339e539863c678b6ed4d4670d75c7ff4c54daa79 
>   src/slave/paths.cpp 03157f93b1e703006f95ef6d0a30afae375dcdb5 
>   src/slave/slave.hpp 58ff2bfac6918d989ab36b67cf6ba2f3657c8356 
>   src/slave/slave.cpp 4bf01f2b020f5e975fb57cffcd19865d7431eac2 
>   src/slave/state.hpp 0de2a4ee4fabaad612c4526166157b001c380bdb 
>   src/slave/state.cpp 04c3d42040f186507a0e484a3ee616f1b1a77ea8 
> 
> Diff: https://reviews.apache.org/r/48313/diff/
> 
> 
> Testing
> -------
> 
> All tests passed.
> 
> 
> Thanks,
> 
> Anindya Sinha
> 
>

Re: Review Request 48313: Creation and deletion of persistent volumes across agent restart.

Reply via email to