> On Oct. 15, 2014, 1:25 a.m., Vinod Kone wrote:
> > src/master/master.cpp, lines 3146-3149
> > <https://reviews.apache.org/r/25525/diff/8/?file=717605#file717605line3146>
> >
> >     Hmm. i don't think we want these semantics. Why not just fail if 
> > readmission failed?
> 
> Cody Maloney wrote:
>     So if the registrar actually fails to readmit the slave, then the master 
> will go down on a LOG(FATAL). In the case where we couldn't readmit the slave 
> for some reason (Ex: the registrar thought the upgrade was incompatible, hit 
> the fall-through case in master/master.hpp), then the slave will be sent a 
> shutdown message and explicitly removed at that point.
>     
>     So I think the semantics you want are there / the comment can just be 
> removed.

i think "we just don't see..." threw me off. instead say that "readmit will 
fail leading to slave shutdown.".

killing the comment is fine too.


> On Oct. 15, 2014, 1:25 a.m., Vinod Kone wrote:
> > src/common/slaveinfo_utils.cpp, lines 64-96
> > <https://reviews.apache.org/r/25525/diff/8/?file=717601#file717601line64>
> >
> >     Hmm. This is still hard to follow. 
> >     
> >     Using variadic templated function here doesn't make it any simpler to 
> > read/follow. Infact it makes it harder because now I need to read 
> > understand what the template function does. I think explicitly checking 
> > each field (as suggested in the earlier comment) reads better.
> 
> Cody Maloney wrote:
>     So I can do that, which gives:
>     ```c++
>       if (!(newSlaveInfo.id() == oldSlaveInfo.id())) {
>         return Error("id cannot be changed (old: " + 
> stringify(oldSlaveInfo.id()) +
>                      " new: " + stringify(newSlaveInfo.id()) + ")");
>       }
>     
>       if (newSlaveInfo.hostname() != oldSlaveInfo.hostname()) {
>         return Error("hostname cannot be changed (old: " +
>                      stringify(oldSlaveInfo.hostname()) + " new: " +
>                      stringify(newSlaveInfo.hostname()) + ")");
>       }
>     
>       if (newSlaveInfo.checkpoint() != oldSlaveInfo.checkpoint()) {
>         return Error("checkpoint cannot be changed (old: " +
>                      stringify(oldSlaveInfo.checkpoint()) + " new: " +
>                      stringify(newSlaveInfo.checkpoint()) + ")");
>       }
>     
>       if (newSlaveInfo.port() != oldSlaveInfo.port()) {
>         return Error("port cannot be changed (old: " +
>                      stringify(oldSlaveInfo.port()) + " new: " +
>                      stringify(newSlaveInfo.port()) + ")");
>       }
>     ```
>     
>     My problems with it are:
>     1. There is a lot of redundant code which needs to be mentally parsed. I 
> have to realize each of those is the same reading 3 lines of code before I 
> can recognize "oh, we're checking over and over again these two fields in the 
> different versions are the same, and mentally condense the code away"
>     2. How likely is it that you would actually notice all the new vs. old if 
> I got one of the error messages wrong? What about if I made any of a large 
> number of possible slight typos in one vs. the next? What if I compare two 
> different string fields? The compiler can't catch any of these or help me 
> with them.
>     
>     
>     ```c++
>     checkMembersUnchanged(newSlaveInfo, oldSlaveInfo,
>         make_pair("SlaveID", &SlaveInfo::id),
>         make_pair("hostname", &SlaveInfo::hostname),
>         make_pair("checkpoint", &SlaveInfo::checkpoint),
>         make_pair("port", &SlaveInfo::port));
>     ```
>     
>     Where when reading just that code, I see I'm giong to check that the 
> members are unchanged, probably between newSlaveInfo and oldSlaveInfo. That 
> the fields are "SlaveID", "hostname", "checkpoint", and "port". 
>     
>     
>     The checkMembersUnchanged takes a little bit more time/effort to 
> recognize, esp. since it is c++11 variadic templates, recursion, and pointers 
> to member functions + calling pointers to member functions (Which is some of 
> the oddest syntax in C++). What is nice about it though is:
>     
>     1. The function though guarantees I call the same function on both old 
> and new / access the same member
>     2. Formats the error string consistently, gives me one place to update / 
> change it if needed
>     3. Makes one place for the ordering to be correct.

If you look around the code base, we prefer approach 1 to 2. It's really not 
that hard to follow that you are doing the same thing. Just adding a comment on 
top would also help guide the users.

For readability, I would also suggest putting "(old: " on the next line and 
"new: " on the next line.

With 2) my problem is that it is too esoteric, both the name and the arguments.


> On Oct. 15, 2014, 1:25 a.m., Vinod Kone wrote:
> > src/master/master.hpp, line 1308
> > <https://reviews.apache.org/r/25525/diff/8/?file=717604#file717604line1308>
> >
> >     This should be CHECK_SOME(compatible).
> 
> Cody Maloney wrote:
>     True, a incompatible change shouldn't get that far because there are 
> advance checks, but practically the registrar will do the right thing / 
> recover in the case of a compatibility error here. CHECK_SOME() means that we 
> will always fatally exit and kill the master, which seems worse to me.

fair enough. you can drop this issue.


> On Oct. 15, 2014, 1:25 a.m., Vinod Kone wrote:
> > src/master/master.cpp, lines 3100-3102
> > <https://reviews.apache.org/r/25525/diff/8/?file=717605#file717605line3100>
> >
> >     ```
> >     LOG(WARNING) << "Slave " << *slave << " attempted to re-register with 
> > incompatible info: " 
> >                  << compatible.error() << "; shutting it down"
> >     ```
> >     
> >     Also, maybe we don't need to explicitly 'removeSlave(slave)' here? If a 
> > compatible slave is not re-registered within timeout, master will 
> > automatically remove it.
> 
> Cody Maloney wrote:
>     Updated the message. I would rather explicitly tell the slave to shutdown 
> than wait, otherwise the slave will continue trying to reregister with the 
> master, the master will keep saying no, until the timeout at which point we 
> will tell the slave to shutdown when it tries to re-authenticate with the 
> message "Slave attempted to re-register after removal".
>     
>     Waiting gives the operator worse information, and makes it a longer loop 
> for getting the slave back into the cluster, than giving the notification, 
> and it really isn't much (or very complex) code.

good point.


> On Oct. 15, 2014, 1:25 a.m., Vinod Kone wrote:
> > src/master/master.cpp, lines 3169-3171
> > <https://reviews.apache.org/r/25525/diff/8/?file=717605#file717605line3169>
> >
> >     This seems very unintuitve for users/operators. Lets fix this in this 
> > patch rather than doing a TODO.
> >     
> >     
> >     Also, AFAICT, the changes are not lost because both the registrar and 
> > the allocator are informed about the new slave info. Am I missing something?
> 
> Cody Maloney wrote:
>     Note this isn't a fall through from the above code. All the above code 
> exits. At this point, we haven't talked to the registrar or allocator. All 
> that has happened since reciept of the message is authentication, checking 
> that the slave hasn't already been removed, and checking that the master 
> doesn't already know about the slave (getSlave reutrns null).
>     
>     So this can be triggered by a slave reregistering, and that first 
> registration being sent to the Registrar.
>     
>     While that registration is happening, the same slave sends another 
> reregistration message, this time with updated SlaveInfo. The re-registration 
> check will be hit, and the updated SlaveInfo will be discarded. per the check.
>     
>     The solutions I can see:
>     1) Send a message back to the slave that we already got a reregistration 
> with the last registration's SlaveInfo, and if it doesn't want that one it 
> needs to update it's registration after that one has completed
>     2) Make it so we can queue multiple in-flight reregistrations for a 
> single slave (No idea if this will break anything / would work if you just 
> removed the check). I worry a little with this option that there could be a 
> lot of reregistrations in flight at the same time to the master and it could 
> get overwhelmed doing extra work where before there was an early exit.

I see. So this happens when a slave is re-registering with a failed over master 
but while the readmission is in progress it re-registers again with an updated 
slave info. That comment needs to be expanded to make that clear.

More importantly, not fixing this is a problem because the slave and master 
would be out of sync w.r.t resources and attributes. I think the 
reregisterSlave() needs to be rethought to make this cleaner. Let me think 
about this a bit. Feel free to come up with suggestions too.


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25525/#review56590
-----------------------------------------------------------


On Oct. 15, 2014, 7:30 p.m., Cody Maloney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25525/
> -----------------------------------------------------------
> 
> (Updated Oct. 15, 2014, 7:30 p.m.)
> 
> 
> Review request for mesos, Adam B and Vinod Kone.
> 
> 
> Bugs: MESOS-1739
>     https://issues.apache.org/jira/browse/MESOS-1739
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Allows attributes and resources to be set to a superset of what they were 
> previously on a slave restart.
> 
> Incorporates all comments from: 
> https://issues.apache.org/jira/browse/MESOS-1739
> and the former review request:
> https://reviews.apache.org/r/25111/
> 
> 
> Diffs
> -----
> 
>   src/Makefile.am d503c8df73cda15a9d59254e8265e4a5d0e003a4 
>   src/common/attributes.hpp 0a043d5b5dca804c6dd215cabd2704f24df71a33 
>   src/common/attributes.cpp aab114e1a5932e3f218b850e1afc7f2ef0f10e21 
>   src/common/slaveinfo_utils.hpp PRE-CREATION 
>   src/common/slaveinfo_utils.cpp PRE-CREATION 
>   src/master/allocator.hpp 02d20d0cc802805bc702891306aa42894531b223 
>   src/master/hierarchical_allocator_process.hpp 
> 31dfb2cc86e2e74da43f05fbada10976ce65f3e4 
>   src/master/master.hpp 14f1d0fd240c9cd0718d0238e1fbb9c733190205 
>   src/master/master.cpp 0a5c9a374062a241c90ea238725fbb8dd2408ef4 
>   src/slave/slave.cpp 0e342ed35e3db3b68f9f32b6cf4ace23e4a4db38 
>   src/tests/attributes_tests.cpp 240a8cac18ac12178cf73e8eeb88bd50e3fcc03b 
>   src/tests/mesos.hpp 957e2233cc11c438fd80d3b6d1907a1223093104 
>   src/tests/slave_tests.cpp f585bdd20ae1af466f2c1b4d85331ac67451552f 
> 
> Diff: https://reviews.apache.org/r/25525/diff/
> 
> 
> Testing
> -------
> 
> make check on localhost
> 
> 
> Thanks,
> 
> Cody Maloney
> 
>

Reply via email to