Re: Review Request 33208: Delete detector in MesosSchedulerDriver::stop
On June 15, 2015, 12:07 p.m., Niklas Nielsen wrote: Hey Robert; BenH helped out and wrote a PoC patch here https://reviews.apache.org/r/35405 In short; it is not safe to delete the detector at this point. The patch above does it in join() and has a good descriptive block of comment explaining the issue. What we need to do is try it out and test it. We are working on another approach, suggested by BenH. Do you want to close this RR and continue in the other one? - Niklas --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33208/#review87958 --- On April 14, 2015, 10:23 p.m., Robert Lacroix wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33208/ --- (Updated April 14, 2015, 10:23 p.m.) Review request for mesos, Benjamin Hindman, Ben Mahler, Niklas Nielsen, and Vinod Kone. Bugs: MESOS-1634 https://issues.apache.org/jira/browse/MESOS-1634 Repository: mesos Description --- When the Mesos Scheduler Driver stops, it does not delete the detector process before the object is garbage collected, which leaves ZooKeeper connections hanging around unnecessarily. This deletes the process on stop as well, not only on destruction. Diffs - src/sched/sched.cpp 66fd2b3146568900356cc48e0f17c6720665ae80 Diff: https://reviews.apache.org/r/33208/diff/ Testing --- make check, manual testing Thanks, Robert Lacroix
Re: Review Request 33208: Delete detector in MesosSchedulerDriver::stop
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33208/#review87958 --- Hey Robert; BenH helped out and wrote a PoC patch here https://reviews.apache.org/r/35405 In short; it is not safe to delete the detector at this point. The patch above does it in join() and has a good descriptive block of comment explaining the issue. What we need to do is try it out and test it. - Niklas Nielsen On April 14, 2015, 10:23 p.m., Robert Lacroix wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33208/ --- (Updated April 14, 2015, 10:23 p.m.) Review request for mesos, Benjamin Hindman, Ben Mahler, Niklas Nielsen, and Vinod Kone. Bugs: MESOS-1634 https://issues.apache.org/jira/browse/MESOS-1634 Repository: mesos Description --- When the Mesos Scheduler Driver stops, it does not delete the detector process before the object is garbage collected, which leaves ZooKeeper connections hanging around unnecessarily. This deletes the process on stop as well, not only on destruction. Diffs - src/sched/sched.cpp 66fd2b3146568900356cc48e0f17c6720665ae80 Diff: https://reviews.apache.org/r/33208/diff/ Testing --- make check, manual testing Thanks, Robert Lacroix
Re: Review Request 33208: Delete detector in MesosSchedulerDriver::stop
On April 20, 2015, 5:55 a.m., Adam B wrote: LGTM, barring a question about ordering/synchronization. I'll let another committer take a look before we commit it. Adam B wrote: Would also like to see a successful ReviewBot pass. That MasterFailover segfault seems like it could be related to your detector changes. Niklas Nielsen wrote: Hey Robert, do you need some help on this patch? Niklas Nielsen wrote: Ping :) Let's get some cadence behind this, or drop it (and reopen when we can work on this again). Niklas Nielsen wrote: Alright - I think I understand the problem. In the test cases, the owned pointer will get double deleted. Let's iterate on how to address this. It's unsafe to delete the detector before we stop SchedulerProcess since we're using the detector in 'SchedulerProcess::detected' and it's possible that the detector will attempt to be used after we dispatch to 'SchedulerProcess::stop' but before we call 'terminate' in 'SchedulerProcess::stop'! We can't delete the detector until after we've waited for the process to terminate. Here's one suggestion, delete the detector in 'join', which is already waiting for 'SchedulerProcess::stop' to complete, so it's not a big deal to also wait for the entire process to terminate (assuming that 'stop' has actually been called). Here's an example of how this might be able to work: https://reviews.apache.org/r/35405 Note that I haven't actually compiled this code or tested it, it's just a proof of concept to hopefully unblock folks here, please let me know if I can help further. The general expectatoin with this could would be that from Java you'd do something like: driver.stop(); driver.join(); // After this returns we know that the detector has been deleted. And in the abort paths it would look like: driver.abort(); driver.stop(); driver.join(); // After this returns we know that the detector has been deleted. - Benjamin --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33208/#review80655 --- On April 15, 2015, 5:23 a.m., Robert Lacroix wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33208/ --- (Updated April 15, 2015, 5:23 a.m.) Review request for mesos, Benjamin Hindman, Ben Mahler, Niklas Nielsen, and Vinod Kone. Bugs: MESOS-1634 https://issues.apache.org/jira/browse/MESOS-1634 Repository: mesos Description --- When the Mesos Scheduler Driver stops, it does not delete the detector process before the object is garbage collected, which leaves ZooKeeper connections hanging around unnecessarily. This deletes the process on stop as well, not only on destruction. Diffs - src/sched/sched.cpp 66fd2b3146568900356cc48e0f17c6720665ae80 Diff: https://reviews.apache.org/r/33208/diff/ Testing --- make check, manual testing Thanks, Robert Lacroix
Re: Review Request 33208: Delete detector in MesosSchedulerDriver::stop
On April 19, 2015, 10:55 p.m., Adam B wrote: LGTM, barring a question about ordering/synchronization. I'll let another committer take a look before we commit it. Adam B wrote: Would also like to see a successful ReviewBot pass. That MasterFailover segfault seems like it could be related to your detector changes. Niklas Nielsen wrote: Hey Robert, do you need some help on this patch? Niklas Nielsen wrote: Ping :) Let's get some cadence behind this, or drop it (and reopen when we can work on this again). Alright - I think I understand the problem. In the test cases, the owned pointer will get double deleted. Let's iterate on how to address this. - Niklas --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33208/#review80655 --- On April 14, 2015, 10:23 p.m., Robert Lacroix wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33208/ --- (Updated April 14, 2015, 10:23 p.m.) Review request for mesos, Benjamin Hindman, Ben Mahler, Niklas Nielsen, and Vinod Kone. Bugs: MESOS-1634 https://issues.apache.org/jira/browse/MESOS-1634 Repository: mesos Description --- When the Mesos Scheduler Driver stops, it does not delete the detector process before the object is garbage collected, which leaves ZooKeeper connections hanging around unnecessarily. This deletes the process on stop as well, not only on destruction. Diffs - src/sched/sched.cpp 66fd2b3146568900356cc48e0f17c6720665ae80 Diff: https://reviews.apache.org/r/33208/diff/ Testing --- make check, manual testing Thanks, Robert Lacroix
Re: Review Request 33208: Delete detector in MesosSchedulerDriver::stop
On April 19, 2015, 10:55 p.m., Adam B wrote: LGTM, barring a question about ordering/synchronization. I'll let another committer take a look before we commit it. Adam B wrote: Would also like to see a successful ReviewBot pass. That MasterFailover segfault seems like it could be related to your detector changes. Niklas Nielsen wrote: Hey Robert, do you need some help on this patch? Ping :) Let's get some cadence behind this, or drop it (and reopen when we can work on this again). - Niklas --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33208/#review80655 --- On April 14, 2015, 10:23 p.m., Robert Lacroix wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33208/ --- (Updated April 14, 2015, 10:23 p.m.) Review request for mesos, Benjamin Hindman, Ben Mahler, Niklas Nielsen, and Vinod Kone. Bugs: MESOS-1634 https://issues.apache.org/jira/browse/MESOS-1634 Repository: mesos Description --- When the Mesos Scheduler Driver stops, it does not delete the detector process before the object is garbage collected, which leaves ZooKeeper connections hanging around unnecessarily. This deletes the process on stop as well, not only on destruction. Diffs - src/sched/sched.cpp 66fd2b3146568900356cc48e0f17c6720665ae80 Diff: https://reviews.apache.org/r/33208/diff/ Testing --- make check, manual testing Thanks, Robert Lacroix