Re: Review Request 33208: Delete detector in MesosSchedulerDriver::stop

2015-08-24 Thread Niklas Nielsen


 On June 15, 2015, 12:07 p.m., Niklas Nielsen wrote:
  Hey Robert; BenH helped out and wrote a PoC patch here 
  https://reviews.apache.org/r/35405
  
  In short; it is not safe to delete the detector at this point. The patch 
  above does it in join() and has a good descriptive block of comment 
  explaining the issue.
  What we need to do is try it out and test it.

We are working on another approach, suggested by BenH. Do you want to close 
this RR and continue in the other one?


- Niklas


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33208/#review87958
---


On April 14, 2015, 10:23 p.m., Robert Lacroix wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33208/
 ---
 
 (Updated April 14, 2015, 10:23 p.m.)
 
 
 Review request for mesos, Benjamin Hindman, Ben Mahler, Niklas Nielsen, and 
 Vinod Kone.
 
 
 Bugs: MESOS-1634
 https://issues.apache.org/jira/browse/MESOS-1634
 
 
 Repository: mesos
 
 
 Description
 ---
 
 When the Mesos Scheduler Driver stops, it does not delete the detector 
 process before the object is garbage collected, which leaves ZooKeeper 
 connections hanging around unnecessarily. This deletes the process on stop as 
 well, not only on destruction.
 
 
 Diffs
 -
 
   src/sched/sched.cpp 66fd2b3146568900356cc48e0f17c6720665ae80 
 
 Diff: https://reviews.apache.org/r/33208/diff/
 
 
 Testing
 ---
 
 make check, manual testing
 
 
 Thanks,
 
 Robert Lacroix
 




Re: Review Request 33208: Delete detector in MesosSchedulerDriver::stop

2015-06-15 Thread Niklas Nielsen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33208/#review87958
---


Hey Robert; BenH helped out and wrote a PoC patch here 
https://reviews.apache.org/r/35405

In short; it is not safe to delete the detector at this point. The patch above 
does it in join() and has a good descriptive block of comment explaining the 
issue.
What we need to do is try it out and test it.

- Niklas Nielsen


On April 14, 2015, 10:23 p.m., Robert Lacroix wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33208/
 ---
 
 (Updated April 14, 2015, 10:23 p.m.)
 
 
 Review request for mesos, Benjamin Hindman, Ben Mahler, Niklas Nielsen, and 
 Vinod Kone.
 
 
 Bugs: MESOS-1634
 https://issues.apache.org/jira/browse/MESOS-1634
 
 
 Repository: mesos
 
 
 Description
 ---
 
 When the Mesos Scheduler Driver stops, it does not delete the detector 
 process before the object is garbage collected, which leaves ZooKeeper 
 connections hanging around unnecessarily. This deletes the process on stop as 
 well, not only on destruction.
 
 
 Diffs
 -
 
   src/sched/sched.cpp 66fd2b3146568900356cc48e0f17c6720665ae80 
 
 Diff: https://reviews.apache.org/r/33208/diff/
 
 
 Testing
 ---
 
 make check, manual testing
 
 
 Thanks,
 
 Robert Lacroix
 




Re: Review Request 33208: Delete detector in MesosSchedulerDriver::stop

2015-06-12 Thread Benjamin Hindman


 On April 20, 2015, 5:55 a.m., Adam B wrote:
  LGTM, barring a question about ordering/synchronization. I'll let another 
  committer take a look before we commit it.
 
 Adam B wrote:
 Would also like to see a successful ReviewBot pass. That MasterFailover 
 segfault seems like it could be related to your detector changes.
 
 Niklas Nielsen wrote:
 Hey Robert, do you need some help on this patch?
 
 Niklas Nielsen wrote:
 Ping :) Let's get some cadence behind this, or drop it (and reopen when 
 we can work on this again).
 
 Niklas Nielsen wrote:
 Alright - I think I understand the problem. In the test cases, the owned 
 pointer will get double deleted.
 
 Let's iterate on how to address this.

It's unsafe to delete the detector before we stop SchedulerProcess since we're 
using the detector in 'SchedulerProcess::detected' and it's possible that the 
detector will attempt to be used after we dispatch to 'SchedulerProcess::stop' 
but before we call 'terminate' in 'SchedulerProcess::stop'!

We can't delete the detector until after we've waited for the process to 
terminate. Here's one suggestion, delete the detector in 'join', which is 
already waiting for 'SchedulerProcess::stop' to complete, so it's not a big 
deal to also wait for the entire process to terminate (assuming that 'stop' has 
actually been called). Here's an example of how this might be able to work: 
https://reviews.apache.org/r/35405

Note that I haven't actually compiled this code or tested it, it's just a proof 
of concept to hopefully unblock folks here, please let me know if I can help 
further. The general expectatoin with this could would be that from Java you'd 
do something like:

driver.stop();
driver.join(); // After this returns we know that the detector has been deleted.

And in the abort paths it would look like:

driver.abort();
driver.stop();
driver.join(); // After this returns we know that the detector has been deleted.


- Benjamin


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33208/#review80655
---


On April 15, 2015, 5:23 a.m., Robert Lacroix wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33208/
 ---
 
 (Updated April 15, 2015, 5:23 a.m.)
 
 
 Review request for mesos, Benjamin Hindman, Ben Mahler, Niklas Nielsen, and 
 Vinod Kone.
 
 
 Bugs: MESOS-1634
 https://issues.apache.org/jira/browse/MESOS-1634
 
 
 Repository: mesos
 
 
 Description
 ---
 
 When the Mesos Scheduler Driver stops, it does not delete the detector 
 process before the object is garbage collected, which leaves ZooKeeper 
 connections hanging around unnecessarily. This deletes the process on stop as 
 well, not only on destruction.
 
 
 Diffs
 -
 
   src/sched/sched.cpp 66fd2b3146568900356cc48e0f17c6720665ae80 
 
 Diff: https://reviews.apache.org/r/33208/diff/
 
 
 Testing
 ---
 
 make check, manual testing
 
 
 Thanks,
 
 Robert Lacroix
 




Re: Review Request 33208: Delete detector in MesosSchedulerDriver::stop

2015-06-11 Thread Niklas Nielsen


 On April 19, 2015, 10:55 p.m., Adam B wrote:
  LGTM, barring a question about ordering/synchronization. I'll let another 
  committer take a look before we commit it.
 
 Adam B wrote:
 Would also like to see a successful ReviewBot pass. That MasterFailover 
 segfault seems like it could be related to your detector changes.
 
 Niklas Nielsen wrote:
 Hey Robert, do you need some help on this patch?
 
 Niklas Nielsen wrote:
 Ping :) Let's get some cadence behind this, or drop it (and reopen when 
 we can work on this again).

Alright - I think I understand the problem. In the test cases, the owned 
pointer will get double deleted.

Let's iterate on how to address this.


- Niklas


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33208/#review80655
---


On April 14, 2015, 10:23 p.m., Robert Lacroix wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33208/
 ---
 
 (Updated April 14, 2015, 10:23 p.m.)
 
 
 Review request for mesos, Benjamin Hindman, Ben Mahler, Niklas Nielsen, and 
 Vinod Kone.
 
 
 Bugs: MESOS-1634
 https://issues.apache.org/jira/browse/MESOS-1634
 
 
 Repository: mesos
 
 
 Description
 ---
 
 When the Mesos Scheduler Driver stops, it does not delete the detector 
 process before the object is garbage collected, which leaves ZooKeeper 
 connections hanging around unnecessarily. This deletes the process on stop as 
 well, not only on destruction.
 
 
 Diffs
 -
 
   src/sched/sched.cpp 66fd2b3146568900356cc48e0f17c6720665ae80 
 
 Diff: https://reviews.apache.org/r/33208/diff/
 
 
 Testing
 ---
 
 make check, manual testing
 
 
 Thanks,
 
 Robert Lacroix
 




Re: Review Request 33208: Delete detector in MesosSchedulerDriver::stop

2015-06-01 Thread Niklas Nielsen


 On April 19, 2015, 10:55 p.m., Adam B wrote:
  LGTM, barring a question about ordering/synchronization. I'll let another 
  committer take a look before we commit it.
 
 Adam B wrote:
 Would also like to see a successful ReviewBot pass. That MasterFailover 
 segfault seems like it could be related to your detector changes.
 
 Niklas Nielsen wrote:
 Hey Robert, do you need some help on this patch?

Ping :) Let's get some cadence behind this, or drop it (and reopen when we can 
work on this again).


- Niklas


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33208/#review80655
---


On April 14, 2015, 10:23 p.m., Robert Lacroix wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33208/
 ---
 
 (Updated April 14, 2015, 10:23 p.m.)
 
 
 Review request for mesos, Benjamin Hindman, Ben Mahler, Niklas Nielsen, and 
 Vinod Kone.
 
 
 Bugs: MESOS-1634
 https://issues.apache.org/jira/browse/MESOS-1634
 
 
 Repository: mesos
 
 
 Description
 ---
 
 When the Mesos Scheduler Driver stops, it does not delete the detector 
 process before the object is garbage collected, which leaves ZooKeeper 
 connections hanging around unnecessarily. This deletes the process on stop as 
 well, not only on destruction.
 
 
 Diffs
 -
 
   src/sched/sched.cpp 66fd2b3146568900356cc48e0f17c6720665ae80 
 
 Diff: https://reviews.apache.org/r/33208/diff/
 
 
 Testing
 ---
 
 make check, manual testing
 
 
 Thanks,
 
 Robert Lacroix