Does anyone have any thoughts on this? Either, -I'm misinterpreting something, and we just need to increase the wait time / decrease the session timeout on the CURATOR-335 test. or -The way that the new connection state manager works is incorrect and needs to be fixed.
I'm leaning to the second one. My expectation would be that a LOST event would be generated after session timeout MS had passed with no connection to ZK. Not 4/3 of the session timeout MS. On Tue, Jun 7, 2016 at 5:13 PM, Cameron McKenzie <[email protected]> wrote: > I think that the problem is that the processEvents() loop in the > ConnectionStateManager class checks for timeouts while in a suspended state > only every 2/3 of the negotiated session length. > > I think that the test in CURATOR-335 is perhaps a bit strange in that it > uses the same session timeout as the amount of time it will wait for the > session to timeout to occur (+ a sleepForABitCall()), so there's not much > room for error. That's why it has exposed this issue. > > The session timeout is set to 50 seconds. > > The initial SUSPEND event occurs. > Then 50 * (2/3) seconds later the processEvents() loop wakes up again > after no events have occurred, so checks timeouts. It still hasn't timed > out, so it polls again for another 50 * (2/3) seconds. There are no > additional events in this time, which means that the next timeout check > doesn't occur until 66 seconds (100 * (2/3)) instead of after 50 seconds as > you would expect. > > If I set the assertions to wait for up to 70 seconds for the appropriate > state to be achieved, then the test passes fine. > > The poll call in the processEvents loop must take into account how much > time has already been spent in a suspended state. > > Jordan, do you remember what the rationale behind only waiting for 2/3 of > the session timeout is? It's something to do with the way that ZK itself > handles session timeouts isn't it? Does ZK timeout the session if it hasn't > received a heartbeat for 2/3 of the session timeout? I can't remember. > cheers > > > > > On Tue, Jun 7, 2016 at 10:41 AM, Cameron McKenzie <[email protected]> > wrote: > >> Seems like I have uncovered another problem on the 3.0 branch. >> >> It looks like the new (ish) connection handling stuff doesn't seem to be >> working correctly for long session timeouts. Specifically, the test for >> CURATOR-335 fails on the 3.0 branch when run with the new connection >> handling, but works with the 'classic' connection handling. >> >> It fails when asserting that the LOST event occurs after the server is >> stopped. >> >> I'm not going to have time to do much more digging for at least today, >> but I have made a more targeted test case: >> >> TestFramework:testSessionLossWithLongTimeout on >> the long_session_timeout_issue branch. >> >> if anyone has time to look before I do. >> >> I think that this needs to be resolved before 3.0 can be released. >> cheers >> >> >> On Mon, Jun 6, 2016 at 9:49 AM, Jordan Zimmerman < >> [email protected]> wrote: >> >>> :D >>> >>> > Is it worth holding up the build to merge CURATOR-331? >>> No, let’s go with what we have. >>> >>> > On Jun 5, 2016, at 6:48 PM, Cameron McKenzie <[email protected]> >>> wrote: >>> > >>> > Ah, must still be recovering, I'm sure I saw it was being applied to >>> the >>> > 3.0 branch. >>> > >>> > I will merge it into master and 3.0. >>> > >>> > Is it worth holding up the build to merge CURATOR-331? I have asked >>> Scott >>> > what his opinion is since its the TreeCache stuff. It looks ok to me >>> though. >>> > cheers >>> > >>> > On Mon, Jun 6, 2016 at 9:44 AM, Jordan Zimmerman < >>> [email protected] >>> >> wrote: >>> > >>> >> Yes, that’s correct. It’s a patch against master. I’ll do the merge if >>> >> you’re OK with it. >>> >> >>> >> -Jordan >>> >> >>> >>> On Jun 5, 2016, at 6:42 PM, Cameron McKenzie <[email protected] >>> > >>> >> wrote: >>> >>> >>> >>> hey Jordan, >>> >>> The fix for CURATOR-335 looks good to me, but I'm wondering if it >>> should >>> >>> actually be applied against master and then merged into 3.0? >>> >>> >>> >>> On Fri, Jun 3, 2016 at 12:21 PM, Jordan Zimmerman < >>> >>> [email protected]> wrote: >>> >>> >>> >>>> no worries - get well. >>> >>>> >>> >>>>> On Jun 2, 2016, at 9:20 PM, Cameron McKenzie < >>> [email protected]> >>> >>>> wrote: >>> >>>>> >>> >>>>> Thanks for sorting this out Jordan. I'm pretty sick today so won't >>> get >>> >>>>> around to looking at it, but I will try over the weekend or really >>> next >>> >>>> week >>> >>>>> On 3 Jun 2016 7:05 AM, "Jordan Zimmerman" < >>> [email protected]> >>> >>>>> wrote: >>> >>>>> >>> >>>>>>> It sounds like curator is using a different algorithm since it >>> has >>> >>>>>>> nodes sorting their position to determine if they have a lease or >>> >> not. >>> >>>>>> >>> >>>>>> No - I just added that as I thought there was a bug. But, now I >>> >> realize >>> >>>>>> I’m wrong. So, it was correct all along. Thanks Ben. >>> >>>>>> >>> >>>>>> -Jordan >>> >>>> >>> >>>> >>> >> >>> >> >>> >>> >> >
