> On March 20, 2014, 9:21 p.m., Vinod Kone wrote: > > src/master/master.hpp, line 400 > > <https://reviews.apache.org/r/19383/diff/1/?file=526626#file526626line400> > > > > Comment?
> On March 20, 2014, 9:21 p.m., Vinod Kone wrote: > > src/master/master.hpp, line 404 > > <https://reviews.apache.org/r/19383/diff/1/?file=526626#file526626line404> > > > > Why is Registrar in caps? > > > > s/Registrar/registrar/ to be consistent. Here and everywhere else. Okay, but we're only consistent here with "Master" and "Slave", not so much with other things. For example, we refer to "Framework" below. > On March 20, 2014, 9:21 p.m., Vinod Kone wrote: > > src/master/master.hpp, line 421 > > <https://reviews.apache.org/r/19383/diff/1/?file=526626#file526626line421> > > > > Why cache instead of a circular buffer? What's the issue here? Circular buffer does not support lookup. In the past we've used circular buffer because we've wanted two properties: 1. Set semantics. 2. LRI semantics. In this case, we want LRU semantics and cache<X, Nothing> provides the desired set semantics. > On March 20, 2014, 9:21 p.m., Vinod Kone wrote: > > src/master/master.cpp, lines 609-610 > > <https://reviews.apache.org/r/19383/diff/1/?file=526627#file526627line609> > > > > But they could still fire if they are already dispatched and are in the > > queue right? Would that cause problems? It's a race of course, but it's likely that the dispatches would be dropped before the new Master started. Also, the timing required for these to fire but be in the queue when we're destructing is pretty precise, do you have any suggestions here? > On March 20, 2014, 9:21 p.m., Vinod Kone wrote: > > src/master/master.cpp, line 1946 > > <https://reviews.apache.org/r/19383/diff/1/?file=526627#file526627line1946> > > > > why the change? what does activated mean? I changed this because "slave does not exist" is pretty generic and does not capture those slaves that are registering, re-registering, being removed, etc. These slaves "exist" despite what the log message says, so I'm trying to be a bit more accurate in our wording. Suggestions? > On March 20, 2014, 9:21 p.m., Vinod Kone wrote: > > src/master/master.cpp, line 2072 > > <https://reviews.apache.org/r/19383/diff/1/?file=526627#file526627line2072> > > > > s/reply/send(from,/ ? Why change this, the method is using 'reply' elsewhere. > On March 20, 2014, 9:21 p.m., Vinod Kone wrote: > > src/master/master.cpp, line 2270 > > <https://reviews.apache.org/r/19383/diff/1/?file=526627#file526627line2270> > > > > Why not use the getSlave() helper? The plan is to kill the getSlave() helpers now that we have hashmap providing nice get() and contains() methods. Also because getSlave() is pretty implicit in the fact that it retrieves "activated" slaves. Now that we have more than a single "slaves" map, we should probably be explicit about which kind of slaves we're dealing with in the code. > On March 20, 2014, 9:21 p.m., Vinod Kone wrote: > > src/master/master.cpp, lines 3267-3270 > > <https://reviews.apache.org/r/19383/diff/1/?file=526627#file526627line3267> > > > > Why do it this way? Why not do the bulk of this method in > > _removeSlave() ? Have you thought through the implications of that approach? If we keep the slave around in-memory when we decided to remove it, we need to treat this slave in a special manner to keep things consistent for frameworks. We're likely to end up with bugs / consistency problems if we do this. Say we decide to remove the slave but we keep it in-memory. If a status update arrives from this slave before the write completes, we better drop the update or else we're potentially going to lie about a TASK_LOST in _removeSlave. The only way the Registrar can deny a removal operation is if the slave is already removed, so removing it from the in-memory state first is safe. We also don't have to think about all the cases where a slave might send a message between removeSlave and _removeSlave. > On March 20, 2014, 9:21 p.m., Vinod Kone wrote: > > src/master/master.cpp, lines 3326-3328 > > <https://reviews.apache.org/r/19383/diff/1/?file=526627#file526627line3326> > > > > This almost feels like it could use a > > 'state' variable in the Slave struct to transition it through various > > states. Have you considered that option? Yes :) But unfortunately we can't introduce it in 'Slave' because we probably don't want to keep 'Slave' objects around so that we can track things like "deactivated" and "removing" slaves. We also wouldn't be able to do look-ups. Ideally we could more explicitly model the state machine for slaves, but that's probably something we should consider for a follow up change! - Ben ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/19383/#review37918 ----------------------------------------------------------- On March 19, 2014, 12:57 a.m., Ben Mahler wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/19383/ > ----------------------------------------------------------- > > (Updated March 19, 2014, 12:57 a.m.) > > > Review request for mesos, Benjamin Hindman and Vinod Kone. > > > Bugs: MESOS-764 > https://issues.apache.org/jira/browse/MESOS-764 > > > Repository: mesos-git > > > Description > ------- > > This implements the Registry-backed Master, with the following exceptions > that will be addressed in follow up changes: > > -Note that the --registry_strict flag is enforced to be false in > master/main.cpp. > -Reconciliation remains unimplemented as before. > -Improvements can be made to killTask, specifically we should add SlaveID to > the message in order to drop fewer requests for unknown slaves. > -Orthogonally, this does not address MESOS-682. > > I've updated 'deactivated' slaves to be a cache of SlaveIDs rather than UPIDs > as this was the intent originally (we were concerned about the unbounded > growth of the set, but cache<SlaveID, Nothing> keeps a fixed capacity). > > > Diffs > ----- > > src/master/constants.hpp cdaaad060d4ee777f8b0838b63c0fd031da861ea > src/master/constants.cpp 18548834468243bef8ae090f70363e2b9f571ac5 > src/master/master.hpp 0c7c5204c31087f12c4e98028f90c1b941eab7c7 > src/master/master.cpp 6da776699beb6f449e8160dcb6a125d94c1ab437 > src/messages/messages.proto c26a3d0e69bbbd447c859cf175c139ab8948fde2 > src/slave/slave.cpp d8d3e0fa54972201d72b2650ec0ba922a4912d54 > > Diff: https://reviews.apache.org/r/19383/diff/ > > > Testing > ------- > > This change preserves the previous semantics and so all existing tests pass. > > This is because the Registrar can only operate in a "non-strict" manner. > > An unfortunate effect of this change is that many tests run slower due to the > fact that messages are dropped while we're recovering, an alternative > approach here would be to re-enqueue *all* incoming messages through > recover(). However, this adds queuing delay to each message processed in the > Master and the performance implications of this are not well understood for > large clusters. > > > Thanks, > > Ben Mahler > >
