Sorry - I’ve been tied up. I’ll take a look tomorrow. -Jordan
> On Jun 8, 2016, at 5:33 PM, Cameron McKenzie <[email protected]> wrote: > > Does anyone have any thoughts on this? > > Either, > -I'm misinterpreting something, and we just need to increase the wait time > / decrease the session timeout on the CURATOR-335 test. > or > -The way that the new connection state manager works is incorrect and needs > to be fixed. > > I'm leaning to the second one. My expectation would be that a LOST event > would be generated after session timeout MS had passed with no connection > to ZK. Not 4/3 of the session timeout MS. > > > > On Tue, Jun 7, 2016 at 5:13 PM, Cameron McKenzie <[email protected]> > wrote: > >> I think that the problem is that the processEvents() loop in the >> ConnectionStateManager class checks for timeouts while in a suspended state >> only every 2/3 of the negotiated session length. >> >> I think that the test in CURATOR-335 is perhaps a bit strange in that it >> uses the same session timeout as the amount of time it will wait for the >> session to timeout to occur (+ a sleepForABitCall()), so there's not much >> room for error. That's why it has exposed this issue. >> >> The session timeout is set to 50 seconds. >> >> The initial SUSPEND event occurs. >> Then 50 * (2/3) seconds later the processEvents() loop wakes up again >> after no events have occurred, so checks timeouts. It still hasn't timed >> out, so it polls again for another 50 * (2/3) seconds. There are no >> additional events in this time, which means that the next timeout check >> doesn't occur until 66 seconds (100 * (2/3)) instead of after 50 seconds as >> you would expect. >> >> If I set the assertions to wait for up to 70 seconds for the appropriate >> state to be achieved, then the test passes fine. >> >> The poll call in the processEvents loop must take into account how much >> time has already been spent in a suspended state. >> >> Jordan, do you remember what the rationale behind only waiting for 2/3 of >> the session timeout is? It's something to do with the way that ZK itself >> handles session timeouts isn't it? Does ZK timeout the session if it hasn't >> received a heartbeat for 2/3 of the session timeout? I can't remember. >> cheers >> >> >> >> >> On Tue, Jun 7, 2016 at 10:41 AM, Cameron McKenzie <[email protected]> >> wrote: >> >>> Seems like I have uncovered another problem on the 3.0 branch. >>> >>> It looks like the new (ish) connection handling stuff doesn't seem to be >>> working correctly for long session timeouts. Specifically, the test for >>> CURATOR-335 fails on the 3.0 branch when run with the new connection >>> handling, but works with the 'classic' connection handling. >>> >>> It fails when asserting that the LOST event occurs after the server is >>> stopped. >>> >>> I'm not going to have time to do much more digging for at least today, >>> but I have made a more targeted test case: >>> >>> TestFramework:testSessionLossWithLongTimeout on >>> the long_session_timeout_issue branch. >>> >>> if anyone has time to look before I do. >>> >>> I think that this needs to be resolved before 3.0 can be released. >>> cheers >>> >>> >>> On Mon, Jun 6, 2016 at 9:49 AM, Jordan Zimmerman < >>> [email protected]> wrote: >>> >>>> :D >>>> >>>>> Is it worth holding up the build to merge CURATOR-331? >>>> No, let’s go with what we have. >>>> >>>>> On Jun 5, 2016, at 6:48 PM, Cameron McKenzie <[email protected]> >>>> wrote: >>>>> >>>>> Ah, must still be recovering, I'm sure I saw it was being applied to >>>> the >>>>> 3.0 branch. >>>>> >>>>> I will merge it into master and 3.0. >>>>> >>>>> Is it worth holding up the build to merge CURATOR-331? I have asked >>>> Scott >>>>> what his opinion is since its the TreeCache stuff. It looks ok to me >>>> though. >>>>> cheers >>>>> >>>>> On Mon, Jun 6, 2016 at 9:44 AM, Jordan Zimmerman < >>>> [email protected] >>>>>> wrote: >>>>> >>>>>> Yes, that’s correct. It’s a patch against master. I’ll do the merge if >>>>>> you’re OK with it. >>>>>> >>>>>> -Jordan >>>>>> >>>>>>> On Jun 5, 2016, at 6:42 PM, Cameron McKenzie <[email protected] >>>>> >>>>>> wrote: >>>>>>> >>>>>>> hey Jordan, >>>>>>> The fix for CURATOR-335 looks good to me, but I'm wondering if it >>>> should >>>>>>> actually be applied against master and then merged into 3.0? >>>>>>> >>>>>>> On Fri, Jun 3, 2016 at 12:21 PM, Jordan Zimmerman < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> no worries - get well. >>>>>>>> >>>>>>>>> On Jun 2, 2016, at 9:20 PM, Cameron McKenzie < >>>> [email protected]> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Thanks for sorting this out Jordan. I'm pretty sick today so won't >>>> get >>>>>>>>> around to looking at it, but I will try over the weekend or really >>>> next >>>>>>>> week >>>>>>>>> On 3 Jun 2016 7:05 AM, "Jordan Zimmerman" < >>>> [email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>>> It sounds like curator is using a different algorithm since it >>>> has >>>>>>>>>>> nodes sorting their position to determine if they have a lease or >>>>>> not. >>>>>>>>>> >>>>>>>>>> No - I just added that as I thought there was a bug. But, now I >>>>>> realize >>>>>>>>>> I’m wrong. So, it was correct all along. Thanks Ben. >>>>>>>>>> >>>>>>>>>> -Jordan >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>> >>>> >>> >>
