Re: locking/leader election and dealing with session loss

Jordan Zimmerman Thu, 16 Jul 2015 11:46:01 -0700

And a new Curator Tech Note to match:

https://cwiki.apache.org/confluence/display/CURATOR/TN10


-JZ



On July 16, 2015 at 12:54:29 PM, Ivan Kelly ([email protected]) wrote:

I've seen 40s+.  

Also, if combined with a network partition, the gc pause only needs 1/3 of  
session timeout for the same effect to occur.  

On Thu, 16 Jul 2015 15:58 Camille Fournier <[email protected]> wrote:  

> They can and have happened in prod to people. I started taking about it  
> after hearing enough people complain about just this situation on twitter.  
> If you are relying on very large jvm memory footprints a 30s gc pause can  
> and should be expected. In general I think most people don't need to worry  
> about this most of the time but it's one of those things that happens and  
> the developers are almost always shocked. I'm a fan of being clear about  
> edge cases, even rare ones, so that devs can make the right tradeoffs for  
> their env.  
> Of course there are a myriad theoretical possibilities. But I don’t  
> believe any of what you’ve mentioned will happen in production. For any  
> reasonable case, you can be guaranteed that no two processes will consider  
> themselves lock holders at the same instant in time.  
>  
> -Jordan  
>  
>  
> On July 16, 2015 at 7:58:06 AM, Ivan Kelly ([email protected]) wrote:  
>  
> On Thu, Jul 16, 2015 at 1:38 PM Jordan Zimmerman <  
> [email protected]>  
> wrote:  
>  
> > Are you really seeing 30s gc pauses in production? If so, then of course  
> > this could happen. However, if your application can tolerate a 30s pause  
> > (which is hard to believe) then your session timeout is too low. The  
> point  
> > of the session timeout is to have enough coverage. So, if your app has 30  
> > seconds allowable pauses your session timeout would have to be much  
> longer.  
> >  
> GC is just an example. There's other ways the same scenario could happen.  
> The machine could swap out the process due to load. Someone could do  
> something stupid in the zookeeper event thread and the session expired  
> event is delayed. The state update could have hit the ip stack during  
> network partition, and the process then got wedged. The state update packet  
> could have hit the network and been routed via the moon. The clock could  
> break.  
>  
> If you are relying on a timer on the zk client to maintain a guarantee,  
> then you really aren't giving any guarantee because the zk client doesn't  
> have control over all the things that could go wrong.  
>  
> -Ivan  
>

Re: locking/leader election and dealing with session loss

Reply via email to