> On June 20, 2016, 5:16 p.m., Jiang Yan Xu wrote: > > In terms of testing, if we don't crash the agent within > > `syncCheckpointedResources()` but rather return a failure when its fails > > during recovery, we can capture this in `Slave::__recover` and verify the > > failed future right? > > Jiang Yan Xu wrote: > Echoing Neil on /r/48315/, can we add a test?
Added in https://reviews.apache.org/r/48315/. > On June 20, 2016, 5:16 p.m., Jiang Yan Xu wrote: > > src/slave/slave.cpp, lines 2520-2521 > > <https://reviews.apache.org/r/48313/diff/4/?file=1420538#file1420538line2520> > > > > If this fails, we don't know how it has failed and if the agent crashes > > next and restarts, it's possible for the agent to recover some non-empty > > resources from the target and erroneously starts deleting directories. > > Possible? > > Anindya Sinha wrote: > is it possible? `os::rm()` for files calls `unlink()` which clears the > inode entry for that file (assuming no other links to that file), although > the data may still exist on disk (until another write operation overwrites > the actual bits). And isn't clearing the inode entry an atomic operation? So > I do not think there is a case where the data is partial. > > Jiang Yan Xu wrote: > I was referring to the use of > `state::checkpoint(paths::getResourcesTargetPath(metaDir), Resources())` to > clean up target checkpointed resources. If we use `os::rm()` it should be no > problem. Ok. Using os::rm(). > On June 20, 2016, 5:16 p.m., Jiang Yan Xu wrote: > > src/slave/slave.cpp, lines 2552-2553 > > <https://reviews.apache.org/r/48313/diff/4/?file=1420538#file1420538line2552> > > > > Can we directly return in case of an error instead of using variables > > like this? > > Anindya Sinha wrote: > There are 3 error conditions here so to consolidate return from one place > which I believe is better than return with the same failures from 3 places... > hence, the variables. I can certainly consolidate `isFailed` and > `additionalReason` into `Option<string> additionalReason = None()` to capture > any errors (and not need `isFailed` in this case). > > I just realized this RR has 1 condition only, but when the changes for RR > 48315 is added, there are 3 conditions. Let me know if you want me to remove > `additionalReason` from this RR and add it in RR 48315. Again, all the RRs > 48313 through to 48315 go together. > > Jiang Yan Xu wrote: > The thing is that the 3 places don't return the same failure, right? They > are errors from three different actions. Combining with /r/48315/, the three > errors are (forgive my line length): > > ``` > return Error("Failed to create the persistent volume" + > volume.disk().persistence().id() + "at '" + path + "': " + mkdir.error()); > > return Error("Failed to determine the disk status of persistent volume" + > volume.disk().persistence().id() + "at '" + path + "': " + empty.error()); > > return Error("New persistent volume" + volume.disk().persistence().id() + > "at '" + path + "' already exists on disk and isn't empty" > ``` > > The only thing they share is `"persistent volume " + > volume.disk().persistence().id() + "at '" + path + "'` so I would use a > variable for that: > > `string volumeDescription = "persistent volume " + > volume.disk().persistence().id() + "at '" + path + "'`; > > As a reader I think this is clearer than tracing > `additionalReason.isSome()` or `isFailed` in multiple branches to determine > where they lead to. > > Speaking of the relationship between /r/48313/ and /r/48315/, mutliple > places in this review point to the later review and you have to use a TODO. > Would it be cleaner to pull out /r/48314/ as the first review in the chain > and combine the other two? Up to you but if it's more effort to seperate > reviews cleanly then leaving them as one is OK too (and it makes it easier to > review). TBH, I do not see how using `additionalReason` to return `Error()` from a single place is confusing at all. The multiple branches are setting the actual error which is returned from a single place which I think is a very common pattern. Obviously, your recommendation of returning `Error()` directly from the 3 conditions is fine also. To avoid unnecessary to and fro on this, I updated the code though. Regarding the sequence of RRs, let us keep the 3 RRs like we have been doing. > On June 20, 2016, 5:16 p.m., Jiang Yan Xu wrote: > > src/slave/slave.cpp, lines 2583-2591 > > <https://reviews.apache.org/r/48313/diff/4/?file=1420538#file1420538line2583> > > > > This comment is supposed to be revised together with /r/48315/? > > Anindya Sinha wrote: > Since RRs 48313 through to 48315 go together, do you want to change this > comment in this RR? > > Jiang Yan Xu wrote: > I should say that the comment about "directory should be empty" seems to > not only belong to review /r/48315/ but also right above where the code is? > How about just remove the TODO here? Ok. Removed this comment from the `DESTROY` section. The `CREATE` section has a comment as to why the directory may exist even if it is a new `CREATE` operation. - Anindya ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/48313/#review138314 ----------------------------------------------------------- On June 20, 2016, 11:41 p.m., Anindya Sinha wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/48313/ > ----------------------------------------------------------- > > (Updated June 20, 2016, 11:41 p.m.) > > > Review request for mesos, Neil Conway and Jiang Yan Xu. > > > Bugs: MESOS-5448 > https://issues.apache.org/jira/browse/MESOS-5448 > > > Repository: mesos > > > Description > ------- > > When the agent receives CheckpointedResourcesMessage, we store the > target checkpoint on disk. On successful create and destroy of > persistent volumes as a part of handling this messages, we commit > the checkpoint on the disk, and clear the target checkpoint. > > However, incase of any failure we do not commit the checkpoint to > disk, and exit the agent. When the agent restarts and there is a > target checkpoint present on disk which differs from the committed > checkpoint, we retry to sync the target and committed checkpoint. > On success, we reregister the agent with the master, but in case it > fails, we do not commit the checkpoint and the agent exists. > > > Diffs > ----- > > src/slave/paths.hpp 339e539863c678b6ed4d4670d75c7ff4c54daa79 > src/slave/paths.cpp 03157f93b1e703006f95ef6d0a30afae375dcdb5 > src/slave/slave.hpp 58ff2bfac6918d989ab36b67cf6ba2f3657c8356 > src/slave/slave.cpp 4bf01f2b020f5e975fb57cffcd19865d7431eac2 > src/slave/state.hpp 0de2a4ee4fabaad612c4526166157b001c380bdb > src/slave/state.cpp 04c3d42040f186507a0e484a3ee616f1b1a77ea8 > > Diff: https://reviews.apache.org/r/48313/diff/ > > > Testing > ------- > > All tests passed. > > > Thanks, > > Anindya Sinha > >