> On Oct. 15, 2014, 4:03 a.m., Adam B wrote:
> > src/slave/slave.cpp, lines 938-948
> > <https://reviews.apache.org/r/26699/diff/1/?file=720970#file720970line938>
> >
> >     Couldn't the Slave and the SUM get out of sync here? Right now, the SUM 
> > will flush its pending status updates as soon as a new master is detected.
> >     I'm imagining a scenario where the SUM is flushing status updates and 
> > the slave handles a status ACK interleaved with a slave re-registration 
> > delivering stale or out-of-sync task states.
> >     Wouldn't it just be better if the SUM didn't flush until after the 
> > slave has successfully re-registered?
> 
> Vinod Kone wrote:
>     Definitely thought about this race. 
>     
>     Yes, it would be better if SUM did the flush after re-registration but I 
> think it is still a race because re-registration could happen due to ZK blips 
> where updates and acks are in flight.
>     
>     I added a comment on why it is safe. Let me know if you still have 
> concerns.

Actually, after thinking more about this and discussing with BenM, there is 
fundamentally still a race between slave sending re-registered message and SUM 
sending its update. To fix this, I will make couple changes.
1) Instead of SUM directly sending updates to the master, it will send to the 
slave which will forward it to the master.
2) Slave updates the latest state of the task when it gets an update from the 
executor and unacknowledged state when it gets an update from the SUM.
3) The latest state in StatusUpdate will be set by the slave (instead of SUM) 
before forwarding it to the master.

These changes should guarantee that re-registration and update messages are 
always in sync w.r.t to the latest and unacknowledged state of a task. As a 
side benefit of 1), updates won't be sent to the master when slave is in the 
middle of re-registration.

Thoughts?


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26699/#review56604
-----------------------------------------------------------


On Oct. 15, 2014, 9:29 p.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26699/
> -----------------------------------------------------------
> 
> (Updated Oct. 15, 2014, 9:29 p.m.)
> 
> 
> Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.
> 
> 
> Bugs: MESOS-1799 and MESOS-1817
>     https://issues.apache.org/jira/browse/MESOS-1799
>     https://issues.apache.org/jira/browse/MESOS-1817
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Slave re-registration now sends both the latest state and unacknowledged 
> state to the master.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp 342b09fc084c20d98d096bb129830440179c092c 
>   src/slave/slave.cpp 0e342ed35e3db3b68f9f32b6cf4ace23e4a4db38 
>   src/tests/fault_tolerance_tests.cpp 
> a75910d4f486230ba3f1d8927e5f1e5fda6e287b 
>   src/tests/slave_tests.cpp f585bdd20ae1af466f2c1b4d85331ac67451552f 
> 
> Diff: https://reviews.apache.org/r/26699/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> Ran new test 1000 times.
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>

Reply via email to