Ben, could you explain a bit more why you think this won't work? I'm trying to decide if I should put in the work to take the POC I wrote and complete it, but I don't really want to waste my time if there's a fundamental reason it's a bad idea.
Thanks, Camille -----Original Message----- From: Benjamin Reed [mailto:br...@yahoo-inc.com] Sent: Wednesday, September 08, 2010 4:03 PM To: zookeeper-user@hadoop.apache.org Subject: Re: closing session on socket close vs waiting for timeout unfortunately, that only works on the standalone server. ben On 09/08/2010 12:52 PM, Fournier, Camille F. [Tech] wrote: > This would be the ideal solution to this problem I think. > Poking around the (3.3) code to figure out how hard it would be to implement, > I figure one way to do it would be to modify the session timeout to the min > session timeout and touch the connection before calling close when you get > certain exceptions in NIOServerCnxn.doIO. I did this (removing the code in > touch session that returns if the tickTime is greater than the expire time) > and it worked (in the standalone server anyway). Interesting solution, or > total hack that will not work beyond most basic test case? > > C > > (forgive lack of actual code in this email) > > -----Original Message----- > From: Ted Dunning [mailto:ted.dunn...@gmail.com] > Sent: Tuesday, September 07, 2010 1:11 PM > To: zookeeper-user@hadoop.apache.org > Cc: Benjamin Reed > Subject: Re: closing session on socket close vs waiting for timeout > > This really is, just as Ben says a problem of false positives and false > negatives in detecting session > expiration. > > On the other hand, the current algorithm isn't really using all the > information available. The current algorithm is > using time since last client initiated heartbeat. The new proposal is > somewhat worse in that it proposes to use > just the boolean "has-TCP-disconnect-happened". > > Perhaps it would be better to use multiple features in order to decrease > both false positives and false negatives. > > For instance, I could imagine that we use the following features: > > - time since last client hearbeat or disconnect or reconnect > > - what was the last event? (a heartbeat or a disconnect or a reconnect) > > Then the expiration algorithm could use a relatively long time since last > heartbeat and a relatively short time since last disconnect to mark a > session as disconnected. > > Wouldn't this avoid expiration during GC and cluster partition and cause > expiration quickly after a client disconnect? > > > On Mon, Sep 6, 2010 at 11:26 PM, Patrick Hunt<ph...@apache.org> wrote: > > >> That's a good point, however with suitable documentation, warnings and such >> it seems like a reasonable feature to provide for those users who require >> it. Used in moderation it seems fine to me. Perhaps we also make it >> configurable at the server level for those administrators/ops who don't >> want >> to deal with it (disable the feature entirely, or only enable on particular >> servers, etc...). >> >> Patrick >> >> On Mon, Sep 6, 2010 at 2:10 PM, Benjamin Reed<br...@yahoo-inc.com> wrote: >> >> >>> if this mechanism were used very often, we would get a huge number of >>> session expirations when a server fails. you are trading fast error >>> detection for the ability to tolerate temporary network and server >>> >> outages. >> >>> to be honest this seems like something that in theory sounds like it will >>> work in practice, but once deployed we start getting session expirations >>> >> for >> >>> cases that we really do not want or expect. >>> >>> ben >>> >>> >>> On 09/01/2010 12:47 PM, Patrick Hunt wrote: >>> >>> >>>> Ben, in this case the session would be tied directly to the connection, >>>> we'd explicitly deny session re-establishment for this session type (so >>>> 4 would fail). Would that address your concern, others? >>>> >>>> Patrick >>>> >>>> On 09/01/2010 10:03 AM, Benjamin Reed wrote: >>>> >>>> >>>> >>>>> i'm a bit skeptical that this is going to work out properly. a server >>>>> may receive a socket reset even though the client is still alive: >>>>> >>>>> 1) client sends a request to a server >>>>> 2) client is partitioned from the server >>>>> 3) server starts trying to send response >>>>> 4) client reconnects to a different server >>>>> 5) partition heals >>>>> 6) server gets a reset from client >>>>> >>>>> at step 6 i don't think you want to delete the ephemeral nodes. >>>>> >>>>> ben >>>>> >>>>> On 08/31/2010 01:41 PM, Fournier, Camille F. [Tech] wrote: >>>>> >>>>> >>>>> >>>>>> Yes that's right. Which network issues can cause the socket to close >>>>>> without the initiating process closing the socket? In my limited >>>>>> experience in this area network issues were more prone to leave dead >>>>>> sockets open rather than vice versa so I don't know what to look out >>>>>> for. >>>>>> >>>>>> Thanks, >>>>>> Camille >>>>>> >>>>>> -----Original Message----- >>>>>> From: Dave Wright [mailto:wrig...@gmail.com] >>>>>> Sent: Tuesday, August 31, 2010 1:14 PM >>>>>> To: zookeeper-user@hadoop.apache.org >>>>>> Subject: Re: closing session on socket close vs waiting for timeout >>>>>> >>>>>> I think he's saying that if the socket closes because of a crash (i.e. >>>>>> not a >>>>>> normal zookeeper close request) then the session stays alive until the >>>>>> session timeout, which is of course true since ZK allows reconnection >>>>>> and >>>>>> resumption of the session in case of disconnect due to network issues. >>>>>> >>>>>> -Dave Wright >>>>>> >>>>>> On Tue, Aug 31, 2010 at 1:03 PM, Ted Dunning<ted.dunn...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> That doesn't sound right to me. >>>>>>> >>>>>>> Is there a Zookeeper expert in the house? >>>>>>> >>>>>>> On Tue, Aug 31, 2010 at 8:58 AM, Fournier, Camille F. [Tech]< >>>>>>> camille.fourn...@gs.com> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> I foolishly did not investigate the ZK code closely enough and it >>>>>>>> seems >>>>>>>> that closing the socket still waits for the session timeout to >>>>>>>> remove the >>>>>>>> session. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>> >>>> >>> >>