> On Oct. 15, 2014, 1:25 a.m., Vinod Kone wrote: > > src/master/master.cpp, lines 3146-3149 > > <https://reviews.apache.org/r/25525/diff/8/?file=717605#file717605line3146> > > > > Hmm. i don't think we want these semantics. Why not just fail if > > readmission failed? > > Cody Maloney wrote: > So if the registrar actually fails to readmit the slave, then the master > will go down on a LOG(FATAL). In the case where we couldn't readmit the slave > for some reason (Ex: the registrar thought the upgrade was incompatible, hit > the fall-through case in master/master.hpp), then the slave will be sent a > shutdown message and explicitly removed at that point. > > So I think the semantics you want are there / the comment can just be > removed.
i think "we just don't see..." threw me off. instead say that "readmit will fail leading to slave shutdown.". killing the comment is fine too. > On Oct. 15, 2014, 1:25 a.m., Vinod Kone wrote: > > src/common/slaveinfo_utils.cpp, lines 64-96 > > <https://reviews.apache.org/r/25525/diff/8/?file=717601#file717601line64> > > > > Hmm. This is still hard to follow. > > > > Using variadic templated function here doesn't make it any simpler to > > read/follow. Infact it makes it harder because now I need to read > > understand what the template function does. I think explicitly checking > > each field (as suggested in the earlier comment) reads better. > > Cody Maloney wrote: > So I can do that, which gives: > ```c++ > if (!(newSlaveInfo.id() == oldSlaveInfo.id())) { > return Error("id cannot be changed (old: " + > stringify(oldSlaveInfo.id()) + > " new: " + stringify(newSlaveInfo.id()) + ")"); > } > > if (newSlaveInfo.hostname() != oldSlaveInfo.hostname()) { > return Error("hostname cannot be changed (old: " + > stringify(oldSlaveInfo.hostname()) + " new: " + > stringify(newSlaveInfo.hostname()) + ")"); > } > > if (newSlaveInfo.checkpoint() != oldSlaveInfo.checkpoint()) { > return Error("checkpoint cannot be changed (old: " + > stringify(oldSlaveInfo.checkpoint()) + " new: " + > stringify(newSlaveInfo.checkpoint()) + ")"); > } > > if (newSlaveInfo.port() != oldSlaveInfo.port()) { > return Error("port cannot be changed (old: " + > stringify(oldSlaveInfo.port()) + " new: " + > stringify(newSlaveInfo.port()) + ")"); > } > ``` > > My problems with it are: > 1. There is a lot of redundant code which needs to be mentally parsed. I > have to realize each of those is the same reading 3 lines of code before I > can recognize "oh, we're checking over and over again these two fields in the > different versions are the same, and mentally condense the code away" > 2. How likely is it that you would actually notice all the new vs. old if > I got one of the error messages wrong? What about if I made any of a large > number of possible slight typos in one vs. the next? What if I compare two > different string fields? The compiler can't catch any of these or help me > with them. > > > ```c++ > checkMembersUnchanged(newSlaveInfo, oldSlaveInfo, > make_pair("SlaveID", &SlaveInfo::id), > make_pair("hostname", &SlaveInfo::hostname), > make_pair("checkpoint", &SlaveInfo::checkpoint), > make_pair("port", &SlaveInfo::port)); > ``` > > Where when reading just that code, I see I'm giong to check that the > members are unchanged, probably between newSlaveInfo and oldSlaveInfo. That > the fields are "SlaveID", "hostname", "checkpoint", and "port". > > > The checkMembersUnchanged takes a little bit more time/effort to > recognize, esp. since it is c++11 variadic templates, recursion, and pointers > to member functions + calling pointers to member functions (Which is some of > the oddest syntax in C++). What is nice about it though is: > > 1. The function though guarantees I call the same function on both old > and new / access the same member > 2. Formats the error string consistently, gives me one place to update / > change it if needed > 3. Makes one place for the ordering to be correct. If you look around the code base, we prefer approach 1 to 2. It's really not that hard to follow that you are doing the same thing. Just adding a comment on top would also help guide the users. For readability, I would also suggest putting "(old: " on the next line and "new: " on the next line. With 2) my problem is that it is too esoteric, both the name and the arguments. > On Oct. 15, 2014, 1:25 a.m., Vinod Kone wrote: > > src/master/master.hpp, line 1308 > > <https://reviews.apache.org/r/25525/diff/8/?file=717604#file717604line1308> > > > > This should be CHECK_SOME(compatible). > > Cody Maloney wrote: > True, a incompatible change shouldn't get that far because there are > advance checks, but practically the registrar will do the right thing / > recover in the case of a compatibility error here. CHECK_SOME() means that we > will always fatally exit and kill the master, which seems worse to me. fair enough. you can drop this issue. > On Oct. 15, 2014, 1:25 a.m., Vinod Kone wrote: > > src/master/master.cpp, lines 3100-3102 > > <https://reviews.apache.org/r/25525/diff/8/?file=717605#file717605line3100> > > > > ``` > > LOG(WARNING) << "Slave " << *slave << " attempted to re-register with > > incompatible info: " > > << compatible.error() << "; shutting it down" > > ``` > > > > Also, maybe we don't need to explicitly 'removeSlave(slave)' here? If a > > compatible slave is not re-registered within timeout, master will > > automatically remove it. > > Cody Maloney wrote: > Updated the message. I would rather explicitly tell the slave to shutdown > than wait, otherwise the slave will continue trying to reregister with the > master, the master will keep saying no, until the timeout at which point we > will tell the slave to shutdown when it tries to re-authenticate with the > message "Slave attempted to re-register after removal". > > Waiting gives the operator worse information, and makes it a longer loop > for getting the slave back into the cluster, than giving the notification, > and it really isn't much (or very complex) code. good point. > On Oct. 15, 2014, 1:25 a.m., Vinod Kone wrote: > > src/master/master.cpp, lines 3169-3171 > > <https://reviews.apache.org/r/25525/diff/8/?file=717605#file717605line3169> > > > > This seems very unintuitve for users/operators. Lets fix this in this > > patch rather than doing a TODO. > > > > > > Also, AFAICT, the changes are not lost because both the registrar and > > the allocator are informed about the new slave info. Am I missing something? > > Cody Maloney wrote: > Note this isn't a fall through from the above code. All the above code > exits. At this point, we haven't talked to the registrar or allocator. All > that has happened since reciept of the message is authentication, checking > that the slave hasn't already been removed, and checking that the master > doesn't already know about the slave (getSlave reutrns null). > > So this can be triggered by a slave reregistering, and that first > registration being sent to the Registrar. > > While that registration is happening, the same slave sends another > reregistration message, this time with updated SlaveInfo. The re-registration > check will be hit, and the updated SlaveInfo will be discarded. per the check. > > The solutions I can see: > 1) Send a message back to the slave that we already got a reregistration > with the last registration's SlaveInfo, and if it doesn't want that one it > needs to update it's registration after that one has completed > 2) Make it so we can queue multiple in-flight reregistrations for a > single slave (No idea if this will break anything / would work if you just > removed the check). I worry a little with this option that there could be a > lot of reregistrations in flight at the same time to the master and it could > get overwhelmed doing extra work where before there was an early exit. I see. So this happens when a slave is re-registering with a failed over master but while the readmission is in progress it re-registers again with an updated slave info. That comment needs to be expanded to make that clear. More importantly, not fixing this is a problem because the slave and master would be out of sync w.r.t resources and attributes. I think the reregisterSlave() needs to be rethought to make this cleaner. Let me think about this a bit. Feel free to come up with suggestions too. - Vinod ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25525/#review56590 ----------------------------------------------------------- On Oct. 15, 2014, 7:30 p.m., Cody Maloney wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/25525/ > ----------------------------------------------------------- > > (Updated Oct. 15, 2014, 7:30 p.m.) > > > Review request for mesos, Adam B and Vinod Kone. > > > Bugs: MESOS-1739 > https://issues.apache.org/jira/browse/MESOS-1739 > > > Repository: mesos-git > > > Description > ------- > > Allows attributes and resources to be set to a superset of what they were > previously on a slave restart. > > Incorporates all comments from: > https://issues.apache.org/jira/browse/MESOS-1739 > and the former review request: > https://reviews.apache.org/r/25111/ > > > Diffs > ----- > > src/Makefile.am d503c8df73cda15a9d59254e8265e4a5d0e003a4 > src/common/attributes.hpp 0a043d5b5dca804c6dd215cabd2704f24df71a33 > src/common/attributes.cpp aab114e1a5932e3f218b850e1afc7f2ef0f10e21 > src/common/slaveinfo_utils.hpp PRE-CREATION > src/common/slaveinfo_utils.cpp PRE-CREATION > src/master/allocator.hpp 02d20d0cc802805bc702891306aa42894531b223 > src/master/hierarchical_allocator_process.hpp > 31dfb2cc86e2e74da43f05fbada10976ce65f3e4 > src/master/master.hpp 14f1d0fd240c9cd0718d0238e1fbb9c733190205 > src/master/master.cpp 0a5c9a374062a241c90ea238725fbb8dd2408ef4 > src/slave/slave.cpp 0e342ed35e3db3b68f9f32b6cf4ace23e4a4db38 > src/tests/attributes_tests.cpp 240a8cac18ac12178cf73e8eeb88bd50e3fcc03b > src/tests/mesos.hpp 957e2233cc11c438fd80d3b6d1907a1223093104 > src/tests/slave_tests.cpp f585bdd20ae1af466f2c1b4d85331ac67451552f > > Diff: https://reviews.apache.org/r/25525/diff/ > > > Testing > ------- > > make check on localhost > > > Thanks, > > Cody Maloney > >