Re: Review Request 33208: Delete detector in MesosSchedulerDriver::stop

2015-05-12 Thread Niklas Nielsen


> On April 19, 2015, 10:55 p.m., Adam B wrote:
> > LGTM, barring a question about ordering/synchronization. I'll let another 
> > committer take a look before we commit it.
> 
> Adam B wrote:
> Would also like to see a successful ReviewBot pass. That MasterFailover 
> segfault seems like it could be related to your detector changes.

Hey Robert, do you need some help on this patch?


- Niklas


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33208/#review80655
---


On April 14, 2015, 10:23 p.m., Robert Lacroix wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33208/
> ---
> 
> (Updated April 14, 2015, 10:23 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Ben Mahler, Niklas Nielsen, and 
> Vinod Kone.
> 
> 
> Bugs: MESOS-1634
> https://issues.apache.org/jira/browse/MESOS-1634
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> When the Mesos Scheduler Driver stops, it does not delete the detector 
> process before the object is garbage collected, which leaves ZooKeeper 
> connections hanging around unnecessarily. This deletes the process on stop as 
> well, not only on destruction.
> 
> 
> Diffs
> -
> 
>   src/sched/sched.cpp 66fd2b3146568900356cc48e0f17c6720665ae80 
> 
> Diff: https://reviews.apache.org/r/33208/diff/
> 
> 
> Testing
> ---
> 
> make check, manual testing
> 
> 
> Thanks,
> 
> Robert Lacroix
> 
>



Re: Review Request 33208: Delete detector in MesosSchedulerDriver::stop

2015-06-01 Thread Niklas Nielsen


> On April 19, 2015, 10:55 p.m., Adam B wrote:
> > LGTM, barring a question about ordering/synchronization. I'll let another 
> > committer take a look before we commit it.
> 
> Adam B wrote:
> Would also like to see a successful ReviewBot pass. That MasterFailover 
> segfault seems like it could be related to your detector changes.
> 
> Niklas Nielsen wrote:
> Hey Robert, do you need some help on this patch?

Ping :) Let's get some cadence behind this, or drop it (and reopen when we can 
work on this again).


- Niklas


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33208/#review80655
---


On April 14, 2015, 10:23 p.m., Robert Lacroix wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33208/
> ---
> 
> (Updated April 14, 2015, 10:23 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Ben Mahler, Niklas Nielsen, and 
> Vinod Kone.
> 
> 
> Bugs: MESOS-1634
> https://issues.apache.org/jira/browse/MESOS-1634
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> When the Mesos Scheduler Driver stops, it does not delete the detector 
> process before the object is garbage collected, which leaves ZooKeeper 
> connections hanging around unnecessarily. This deletes the process on stop as 
> well, not only on destruction.
> 
> 
> Diffs
> -
> 
>   src/sched/sched.cpp 66fd2b3146568900356cc48e0f17c6720665ae80 
> 
> Diff: https://reviews.apache.org/r/33208/diff/
> 
> 
> Testing
> ---
> 
> make check, manual testing
> 
> 
> Thanks,
> 
> Robert Lacroix
> 
>



Re: Review Request 33208: Delete detector in MesosSchedulerDriver::stop

2015-06-11 Thread Niklas Nielsen


> On April 19, 2015, 10:55 p.m., Adam B wrote:
> > LGTM, barring a question about ordering/synchronization. I'll let another 
> > committer take a look before we commit it.
> 
> Adam B wrote:
> Would also like to see a successful ReviewBot pass. That MasterFailover 
> segfault seems like it could be related to your detector changes.
> 
> Niklas Nielsen wrote:
> Hey Robert, do you need some help on this patch?
> 
> Niklas Nielsen wrote:
> Ping :) Let's get some cadence behind this, or drop it (and reopen when 
> we can work on this again).

Alright - I think I understand the problem. In the test cases, the owned 
pointer will get double deleted.

Let's iterate on how to address this.


- Niklas


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33208/#review80655
---


On April 14, 2015, 10:23 p.m., Robert Lacroix wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33208/
> ---
> 
> (Updated April 14, 2015, 10:23 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Ben Mahler, Niklas Nielsen, and 
> Vinod Kone.
> 
> 
> Bugs: MESOS-1634
> https://issues.apache.org/jira/browse/MESOS-1634
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> When the Mesos Scheduler Driver stops, it does not delete the detector 
> process before the object is garbage collected, which leaves ZooKeeper 
> connections hanging around unnecessarily. This deletes the process on stop as 
> well, not only on destruction.
> 
> 
> Diffs
> -
> 
>   src/sched/sched.cpp 66fd2b3146568900356cc48e0f17c6720665ae80 
> 
> Diff: https://reviews.apache.org/r/33208/diff/
> 
> 
> Testing
> ---
> 
> make check, manual testing
> 
> 
> Thanks,
> 
> Robert Lacroix
> 
>



Re: Review Request 33208: Delete detector in MesosSchedulerDriver::stop

2015-06-12 Thread Benjamin Hindman


> On April 20, 2015, 5:55 a.m., Adam B wrote:
> > LGTM, barring a question about ordering/synchronization. I'll let another 
> > committer take a look before we commit it.
> 
> Adam B wrote:
> Would also like to see a successful ReviewBot pass. That MasterFailover 
> segfault seems like it could be related to your detector changes.
> 
> Niklas Nielsen wrote:
> Hey Robert, do you need some help on this patch?
> 
> Niklas Nielsen wrote:
> Ping :) Let's get some cadence behind this, or drop it (and reopen when 
> we can work on this again).
> 
> Niklas Nielsen wrote:
> Alright - I think I understand the problem. In the test cases, the owned 
> pointer will get double deleted.
> 
> Let's iterate on how to address this.

It's unsafe to delete the detector before we stop SchedulerProcess since we're 
using the detector in 'SchedulerProcess::detected' and it's possible that the 
detector will attempt to be used after we dispatch to 'SchedulerProcess::stop' 
but before we call 'terminate' in 'SchedulerProcess::stop'!

We can't delete the detector until after we've waited for the process to 
terminate. Here's one suggestion, delete the detector in 'join', which is 
already waiting for 'SchedulerProcess::stop' to complete, so it's not a big 
deal to also wait for the entire process to terminate (assuming that 'stop' has 
actually been called). Here's an example of how this might be able to work: 
https://reviews.apache.org/r/35405

Note that I haven't actually compiled this code or tested it, it's just a proof 
of concept to hopefully unblock folks here, please let me know if I can help 
further. The general expectatoin with this could would be that from Java you'd 
do something like:

driver.stop();
driver.join(); // After this returns we know that the detector has been deleted.

And in the abort paths it would look like:

driver.abort();
driver.stop();
driver.join(); // After this returns we know that the detector has been deleted.


- Benjamin


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33208/#review80655
---


On April 15, 2015, 5:23 a.m., Robert Lacroix wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33208/
> ---
> 
> (Updated April 15, 2015, 5:23 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Ben Mahler, Niklas Nielsen, and 
> Vinod Kone.
> 
> 
> Bugs: MESOS-1634
> https://issues.apache.org/jira/browse/MESOS-1634
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> When the Mesos Scheduler Driver stops, it does not delete the detector 
> process before the object is garbage collected, which leaves ZooKeeper 
> connections hanging around unnecessarily. This deletes the process on stop as 
> well, not only on destruction.
> 
> 
> Diffs
> -
> 
>   src/sched/sched.cpp 66fd2b3146568900356cc48e0f17c6720665ae80 
> 
> Diff: https://reviews.apache.org/r/33208/diff/
> 
> 
> Testing
> ---
> 
> make check, manual testing
> 
> 
> Thanks,
> 
> Robert Lacroix
> 
>



Re: Review Request 33208: Delete detector in MesosSchedulerDriver::stop

2015-06-15 Thread Niklas Nielsen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33208/#review87958
---


Hey Robert; BenH helped out and wrote a PoC patch here 
https://reviews.apache.org/r/35405

In short; it is not safe to delete the detector at this point. The patch above 
does it in join() and has a good descriptive block of comment explaining the 
issue.
What we need to do is try it out and test it.

- Niklas Nielsen


On April 14, 2015, 10:23 p.m., Robert Lacroix wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33208/
> ---
> 
> (Updated April 14, 2015, 10:23 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Ben Mahler, Niklas Nielsen, and 
> Vinod Kone.
> 
> 
> Bugs: MESOS-1634
> https://issues.apache.org/jira/browse/MESOS-1634
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> When the Mesos Scheduler Driver stops, it does not delete the detector 
> process before the object is garbage collected, which leaves ZooKeeper 
> connections hanging around unnecessarily. This deletes the process on stop as 
> well, not only on destruction.
> 
> 
> Diffs
> -
> 
>   src/sched/sched.cpp 66fd2b3146568900356cc48e0f17c6720665ae80 
> 
> Diff: https://reviews.apache.org/r/33208/diff/
> 
> 
> Testing
> ---
> 
> make check, manual testing
> 
> 
> Thanks,
> 
> Robert Lacroix
> 
>



Re: Review Request 33208: Delete detector in MesosSchedulerDriver::stop

2015-08-24 Thread Niklas Nielsen


> On June 15, 2015, 12:07 p.m., Niklas Nielsen wrote:
> > Hey Robert; BenH helped out and wrote a PoC patch here 
> > https://reviews.apache.org/r/35405
> > 
> > In short; it is not safe to delete the detector at this point. The patch 
> > above does it in join() and has a good descriptive block of comment 
> > explaining the issue.
> > What we need to do is try it out and test it.

We are working on another approach, suggested by BenH. Do you want to close 
this RR and continue in the other one?


- Niklas


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33208/#review87958
---


On April 14, 2015, 10:23 p.m., Robert Lacroix wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33208/
> ---
> 
> (Updated April 14, 2015, 10:23 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Ben Mahler, Niklas Nielsen, and 
> Vinod Kone.
> 
> 
> Bugs: MESOS-1634
> https://issues.apache.org/jira/browse/MESOS-1634
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> When the Mesos Scheduler Driver stops, it does not delete the detector 
> process before the object is garbage collected, which leaves ZooKeeper 
> connections hanging around unnecessarily. This deletes the process on stop as 
> well, not only on destruction.
> 
> 
> Diffs
> -
> 
>   src/sched/sched.cpp 66fd2b3146568900356cc48e0f17c6720665ae80 
> 
> Diff: https://reviews.apache.org/r/33208/diff/
> 
> 
> Testing
> ---
> 
> make check, manual testing
> 
> 
> Thanks,
> 
> Robert Lacroix
> 
>