> On Sept. 15, 2015, 10:32 p.m., Vinod Kone wrote: > > src/tests/master_tests.cpp, line 3637 > > <https://reviews.apache.org/r/38003/diff/2/?file=1064545#file1064545line3637> > > > > Does this test reliably fail (i.e., every time) without the code change > > in master.cpp?
Nop; the repro rate is about 90% (9 in 10 times). The root cause is master host re-used port; but if master did not re-use port, this issue will not trigger. For example, I can not reproduce this issue in Ubuntu 14.04 by default setting; but it's easy repro in OS X. > On Sept. 15, 2015, 10:32 p.m., Vinod Kone wrote: > > src/tests/master_tests.cpp, line 3636 > > <https://reviews.apache.org/r/38003/diff/2/?file=1064545#file1064545line3636> > > > > Also add a CHECK_NE() check with both the slave ids? No sure whether it's necessary; if duplicated slave ids in master, master will ask the second slave (with the same id) to shutdown; in this case, it will failed when waiting for re-register message. > On Sept. 15, 2015, 10:32 p.m., Vinod Kone wrote: > > src/tests/master_tests.cpp, lines 3607-3608 > > <https://reviews.apache.org/r/38003/diff/2/?file=1064545#file1064545line3607> > > > > Why specify a mock executor and test containerizer? There's a > > StartSlave() overload that takes just the detector (and optionally flags), > > which you can use? Yes, it's only for detector; let me try to use StartSlave with detector only. - Klaus ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/38003/#review99112 ----------------------------------------------------------- On Sept. 14, 2015, 6:08 p.m., Klaus Ma wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/38003/ > ----------------------------------------------------------- > > (Updated Sept. 14, 2015, 6:08 p.m.) > > > Review request for mesos, Ben Mahler, Jie Yu, and Vinod Kone. > > > Bugs: MESOS-3351 > https://issues.apache.org/jira/browse/MESOS-3351 > > > Repository: mesos > > > Description > ------- > > __Phenomenon:__ > In some race condition, the slave was shutdown when after master failover. > > __Root Cause:__ > The slave was shutdown because of duplicated SlavID: in master, the SlaveID > is genereated by masterInfo.id + "-S" + nextSlaveId; when master failover, > nextSlaveId was reset to 0 and masterInfo.id (generated by date + ip + port + > pid) maybe un-changed which lead to duplicated SlaveID. > > __Solution/Fix:__ > Generate masterInfo.id by UUID instead of "date + ip + port + pid". > > > Diffs > ----- > > src/master/master.cpp 5589eca > src/tests/master_tests.cpp 8a6b98b > > Diff: https://reviews.apache.org/r/38003/diff/ > > > Testing > ------- > > make > make check > > > Thanks, > > Klaus Ma > >