-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48313/#review138314
-----------------------------------------------------------



In terms of testing, if we don't crash the agent within 
`syncCheckpointedResources()` but rather return a failure when its fails during 
recovery, we can capture this in `Slave::__recover` and verify the failed 
future right?


src/slave/paths.hpp (lines 87 - 88)
<https://reviews.apache.org/r/48313/#comment203484>

    Could you add the new file to this figure?



src/slave/slave.hpp (lines 426 - 428)
<https://reviews.apache.org/r/48313/#comment203490>

    Preceding underscores are mostly used for separating similarly named 
continuations.



src/slave/slave.cpp (line 2506)
<https://reviews.apache.org/r/48313/#comment203489>

    Nit: the master and the agent?



src/slave/slave.cpp (line 2511)
<https://reviews.apache.org/r/48313/#comment203666>

    "checkpoint resources target"



src/slave/slave.cpp (line 2513)
<https://reviews.apache.org/r/48313/#comment203491>

    The first argument is already a member variable and doesn't need to be 
passed around right?
    
    It's unclear from the method name whether the "sync" implies syncing the 
two arguments or syncing the disk state and `newCheckpointedResources`. Also 
it's a bit unintuitive that the resources info file is checkpointed in the 
method but member variable `checkpointedResources` is assigned outside instead.
    
    Would the following semantics be clearer?
    
    1. Use this method to do one thing: bring the filesystem in sync with the 
newCheckpointedResources and return an Error if anything goes wrong.
    
    ```
    Try<Nothing> syncCheckpointedResources(const Resources& 
newCheckpointedResources);
    ```
    
    2. On the caller side: if the result is an error, EXIT. Otherwise 
checkpoint the resource info file and remove the target file. Note that we need 
to do similar things in `recover()` but in there you would fail the future 
instead of directly EXIT.



src/slave/slave.cpp (lines 2518 - 2523)
<https://reviews.apache.org/r/48313/#comment203505>

    Use `os::rm()` to remove?



src/slave/slave.cpp (lines 2520 - 2521)
<https://reviews.apache.org/r/48313/#comment203815>

    If this fails, we don't know how it has failed and if the agent crashes 
next and restarts, it's possible for the agent to recover some non-empty 
resources from the target and erroneously starts deleting directories. Possible?



src/slave/slave.cpp (lines 2552 - 2553)
<https://reviews.apache.org/r/48313/#comment203818>

    Can we directly return in case of an error instead of using variables like 
this?



src/slave/slave.cpp (line 2564)
<https://reviews.apache.org/r/48313/#comment203817>

    No cleanups actually, just checking if it's empty right?



src/slave/slave.cpp (lines 2583 - 2585)
<https://reviews.apache.org/r/48313/#comment203819>

    This comment is supposed to be revised together with /r/48315/?



src/slave/slave.cpp (line 4736)
<https://reviews.apache.org/r/48313/#comment203543>

    We need to differentiate between the target being empty (everthing is 
unreserved) and there's no target (changes are processed and committed) right?



src/slave/slave.cpp (line 4741)
<https://reviews.apache.org/r/48313/#comment203833>

    Here we don't want recovery to crash the agent here. Everything else 
returns a falure in recovery and the agent exits in `__recover()`. We should do 
the same for reattempting the sync.
    
    Have `syncCheckpointedResources()` return a Try can help with this.



src/slave/slave.cpp (lines 4748 - 4752)
<https://reviews.apache.org/r/48313/#comment203663>

    Same as in `checkpointResources()`.



src/slave/state.hpp (line 278)
<https://reviews.apache.org/r/48313/#comment203534>

    Here we can have 
    
    ```
    Resources resources;
    Option<Resources> target;
    ```
    
    If ResourcesState is some, it always has `resources` (which could be 
empty), it doesn't always have `target`, which only exists before the requested 
target is committed.
    
    Right?



src/slave/state.hpp (lines 305 - 307)
<https://reviews.apache.org/r/48313/#comment203533>

    It's a bit strange to say that the entire state consists of the slave 
state, the resources state and the target resources state.
    
    IMO both the info and the target should belong to `ResourcesState 
resources` (in the filesystem they are both under "resources" directory. It'll 
be great if the in-memory state and the filesystem hierarchy are structured 
similarly.



src/slave/state.cpp (lines 704 - 707)
<https://reviews.apache.org/r/48313/#comment203537>

    Instead of this, we can factor out the common code that reads the 
`Resources` objects.


- Jiang Yan Xu


On June 15, 2016, 4 p.m., Anindya Sinha wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48313/
> -----------------------------------------------------------
> 
> (Updated June 15, 2016, 4 p.m.)
> 
> 
> Review request for mesos, Neil Conway and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-5448
>     https://issues.apache.org/jira/browse/MESOS-5448
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When the agent receives CheckpointedResourcesMessage, we store the
> target checkpoint on disk. On successful create and destroy of
> persistent volumes as a part of handling this messages, we commit
> the checkpoint on the disk, and clear the target checkpoint.
> 
> However, incase of any failure we do not commit the checkpoint to
> disk, and exit the agent. When the agent restarts and there is a
> target checkpoint present on disk which differs from the committed
> checkpoint, we retry to sync the target and committed checkpoint.
> On success, we reregister the agent with the master, but in case it
> fails, we do not commit the checkpoint and the agent exists.
> 
> 
> Diffs
> -----
> 
>   src/slave/paths.hpp 339e539863c678b6ed4d4670d75c7ff4c54daa79 
>   src/slave/paths.cpp 03157f93b1e703006f95ef6d0a30afae375dcdb5 
>   src/slave/slave.hpp 3ed026e4b7e886ae738567369ee5750591ef97d9 
>   src/slave/slave.cpp 0af04d6fe53f92e03905fb7b3bec72b09d5e8e57 
>   src/slave/state.hpp 0de2a4ee4fabaad612c4526166157b001c380bdb 
>   src/slave/state.cpp 04c3d42040f186507a0e484a3ee616f1b1a77ea8 
> 
> Diff: https://reviews.apache.org/r/48313/diff/
> 
> 
> Testing
> -------
> 
> All tests passed.
> 
> 
> Thanks,
> 
> Anindya Sinha
> 
>

Reply via email to