And a new Curator Tech Note to match: https://cwiki.apache.org/confluence/display/CURATOR/TN10
-JZ On July 16, 2015 at 12:54:29 PM, Ivan Kelly ([email protected]) wrote: I've seen 40s+. Also, if combined with a network partition, the gc pause only needs 1/3 of session timeout for the same effect to occur. On Thu, 16 Jul 2015 15:58 Camille Fournier <[email protected]> wrote: > They can and have happened in prod to people. I started taking about it > after hearing enough people complain about just this situation on twitter. > If you are relying on very large jvm memory footprints a 30s gc pause can > and should be expected. In general I think most people don't need to worry > about this most of the time but it's one of those things that happens and > the developers are almost always shocked. I'm a fan of being clear about > edge cases, even rare ones, so that devs can make the right tradeoffs for > their env. > Of course there are a myriad theoretical possibilities. But I don’t > believe any of what you’ve mentioned will happen in production. For any > reasonable case, you can be guaranteed that no two processes will consider > themselves lock holders at the same instant in time. > > -Jordan > > > On July 16, 2015 at 7:58:06 AM, Ivan Kelly ([email protected]) wrote: > > On Thu, Jul 16, 2015 at 1:38 PM Jordan Zimmerman < > [email protected]> > wrote: > > > Are you really seeing 30s gc pauses in production? If so, then of course > > this could happen. However, if your application can tolerate a 30s pause > > (which is hard to believe) then your session timeout is too low. The > point > > of the session timeout is to have enough coverage. So, if your app has 30 > > seconds allowable pauses your session timeout would have to be much > longer. > > > GC is just an example. There's other ways the same scenario could happen. > The machine could swap out the process due to load. Someone could do > something stupid in the zookeeper event thread and the session expired > event is delayed. The state update could have hit the ip stack during > network partition, and the process then got wedged. The state update packet > could have hit the network and been routed via the moon. The clock could > break. > > If you are relying on a timer on the zk client to maintain a guarantee, > then you really aren't giving any guarantee because the zk client doesn't > have control over all the things that could go wrong. > > -Ivan >
