They can and have happened in prod to people. I started taking about it
after hearing enough people complain about just this situation on twitter.
If you are relying on very large jvm memory footprints a 30s gc pause can
and should be expected. In general I think most people don't need to worry
about this most of the time but it's one of those things that happens and
the developers are almost always shocked. I'm a fan of being clear about
edge cases, even rare ones, so that devs can make the right tradeoffs for
their env.
Of course there are a myriad theoretical possibilities. But I don’t believe
any of what you’ve mentioned will happen in production. For any reasonable
case, you can be guaranteed that no two processes will consider themselves
lock holders at the same instant in time.

-Jordan


On July 16, 2015 at 7:58:06 AM, Ivan Kelly ([email protected]) wrote:

On Thu, Jul 16, 2015 at 1:38 PM Jordan Zimmerman <[email protected]
>
wrote:

> Are you really seeing 30s gc pauses in production? If so, then of course
> this could happen. However, if your application can tolerate a 30s pause
> (which is hard to believe) then your session timeout is too low. The point
> of the session timeout is to have enough coverage. So, if your app has 30
> seconds allowable pauses your session timeout would have to be much
longer.
>
GC is just an example. There's other ways the same scenario could happen.
The machine could swap out the process due to load. Someone could do
something stupid in the zookeeper event thread and the session expired
event is delayed. The state update could have hit the ip stack during
network partition, and the process then got wedged. The state update packet
could have hit the network and been routed via the moon. The clock could
break.

If you are relying on a timer on the zk client to maintain a guarantee,
then you really aren't giving any guarantee because the zk client doesn't
have control over all the things that could go wrong.

-Ivan

Reply via email to