Re: Zookeeper locks issue

Karl Wright Thu, 09 Dec 2021 04:28:04 -0800

The large number of connections can happen but usually that means something
is stuck somewhere and there is a "train wreck" of other locks getting
backed up.


If this is completely repeatable then I think we have an opportunity to
figure out why this is happening.  One thing that is clear is that this
doesn't happen in other situations or in our integration tests, so that
makes it necessary to ask what you may be doing differently here?

I was operating on the assumption that the session just expires from lack
of use, but in this case it may well be the other way around: something
hangs elsewhere and a lock is held open for a very long time, long enough
to exceed the timeout.  If you have dozens of jobs running it might be a
challenge to do this but if you can winnow it down to a small number the
logs may give us a good picture of what is happening.

Karl




On Wed, Dec 8, 2021 at 3:55 PM Julien Massiera <
[email protected]> wrote:

> Hi,
>
> after having increased the session lifetime by 3, the lock error still
> happens and the MCF agent hangs, so all my jobs also hang.
>
> Also, as I said in the other thread today, I notice a very large amount
> of simultaneous connections from the agent to Zookeeper (more than 1000)
> and I cannot tell if it is normal or not.
>
> Can we ignore that particular error and avoid to block an entire MCF node ?
>
> Julien
>
> Le 07/12/2021 à 22:15, Julien Massiera a écrit :
> > Ok that makes sense. But still, I don't understand how the "Can't
> > release lock we don't hold" exception can happen, knowing for sure
> > that neither the Zookeeper process or the MCF agent process have been
> > down and/or restarted. Not sure that increasing the session lifetime
> > would solve that particular issue, and since I have no use case to
> > easily reproduct it, it is very complicated to debug.
> >
> > Julien
> >
> > Le 07/12/2021 à 19:08, Karl Wright a écrit :
> >> What this code is doing is interpreting exceptions back from Zookeeper.
> >> There are some kinds of exceptions it interprets as "session has
> >> expired",
> >> so it rebuilds the session.
> >>
> >> The code is written in such a way that the locks are presumed to persist
> >> beyond the session.  In fact, if they do not persist beyond the session,
> >> there is a risk that proper locks won't be enforced.
> >>
> >> If I recall correctly, we have a number of integration tests that
> >> exercise
> >> Zookeeper integration that are meant to allow sessions to expire and be
> >> re-established.  If what you say is true and information is attached
> >> solely
> >> to a session, Zookeeper cannot possibly work as the cross-process lock
> >> mechanism we use it for.  And yet it is used not just by us in this way,
> >> but by many other projects as well.
> >>
> >> So I think that the diagnosis that nodes in Zookeeper have session
> >> affinity
> >> is not absolutely correct. It may be the case that only one session
> >> *owns*
> >> a node, and if that session expires then the node goes away.  In that
> >> case
> >> I think the right approach is the modify the zookeeper parameters to
> >> increase the session lifetime; I don't see any other way to prevent bad
> >> things from happening.  Presumably, if a session is created within a
> >> process, and the process dies, the session does too.
> >>
> >> Kar
> >>
> >>
> >> On Tue, Dec 7, 2021 at 11:54 AM Julien Massiera <
> >> [email protected]> wrote:
> >>
> >>> Karl,
> >>>
> >>> I tried to understand the Zookeeper lock logic in the code, and the
> >>> only
> >>> thing I don't understand is the 'handleEphemeralNodeKeeperException'
> >>> method that is called in the catch(KeeperException e) of every
> >>> obtain/release lock method of the ZookeeperConnection class.
> >>>
> >>> This method sets the lockNode param to 'null', recreates a session and
> >>> recreates nodes but do not resets the lockNode param at the end. So, as
> >>> I understood it, if it happens it may result in the lock release error
> >>> that I mentioned because this error is triggered when the lockNode
> >>> param
> >>> is 'null'.
> >>>
> >>> The method is in the class
> >>> org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection. If you can
> >>> take a look and tell me what you think about it, it would be great !
> >>>
> >>> Thanks,
> >>>
> >>> Julien
> >>>
> >>> Le 07/12/2021 à 14:40, Julien Massiera a écrit :
> >>>> Yes, I will then try the patch and see if it is working
> >>>>
> >>>> Regards,
> >>>>
> >>>> Julien
> >>>>
> >>>> Le 07/12/2021 à 14:28, Karl Wright a écrit :
> >>>>> Yes, this is plausible.  But I'm not sure what the solution is.  If a
> >>>>> zookeeper session disappears, according to the documentation
> >>>>> everything
> >>>>> associated with that session should also disappear.
> >>>>>
> >>>>> So I guess we could catch this error and just ignore it, assuming
> >>>>> that the
> >>>>> session must be gone anyway?
> >>>>>
> >>>>> Karl
> >>>>>
> >>>>>
> >>>>> On Tue, Dec 7, 2021 at 8:21 AM Julien Massiera <
> >>>>> [email protected]> wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> the Zookeeper lock error mentioned in the before last comment of
> >>>>>> this
> >>>>>> issue https://issues.apache.org/jira/browse/CONNECTORS-1447:
> >>>>>>
> >>>>>> FATAL 2017-08-04 09:28:25,855 (Agents idle cleanup thread) - Error
> >>>>>> tossed:
> >>>>>> Can't release lock we don't hold
> >>>>>> java.lang.IllegalStateException: Can't release lock we don't hold
> >>>>>> at
> >>>>>>
> >>>
> org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection.releaseLock(ZooKeeperConnection.java:815)
>
> >>>
> >>>
> >>>>>> at
> >>>>>>
> >>>
> org.apache.manifoldcf.core.lockmanager.ZooKeeperLockObject.clearLock(ZooKeeperLockObject.java:218)
>
> >>>
> >>>
> >>>>>> at
> >>>>>>
> >>>
> org.apache.manifoldcf.core.lockmanager.ZooKeeperLockObject.clearGlobalWriteLockNoWait(ZooKeeperLockObject.java:100)
>
> >>>
> >>>
> >>>>>> at
> >>>>>>
> >>>
> org.apache.manifoldcf.core.lockmanager.LockObject.clearGlobalWriteLock(LockObject.java:160)
>
> >>>
> >>>
> >>>>>> at
> >>>>>>
> >>>
> org.apache.manifoldcf.core.lockmanager.LockObject.leaveWriteLock(LockObject.java:141)
>
> >>>
> >>>
> >>>>>> at
> >>>>>>
> >>>
> org.apache.manifoldcf.core.lockmanager.LockGate.leaveWriteLock(LockGate.java:205)
>
> >>>
> >>>
> >>>>>> at
> >>>>>>
> >>>
> org.apache.manifoldcf.core.lockmanager.BaseLockManager.leaveWrite(BaseLockManager.java:1224)
>
> >>>
> >>>
> >>>>>> at
> >>>>>>
> >>>
> org.apache.manifoldcf.core.lockmanager.BaseLockManager.leaveWriteLock(BaseLockManager.java:771)
>
> >>>
> >>>
> >>>>>> at
> >>>>>>
> >>>
> org.apache.manifoldcf.core.connectorpool.ConnectorPool$Pool.pollAll(ConnectorPool.java:670)
>
> >>>
> >>>
> >>>>>> at
> >>>>>>
> >>>
> org.apache.manifoldcf.core.connectorpool.ConnectorPool.pollAllConnectors(ConnectorPool.java:338)
>
> >>>
> >>>
> >>>>>> at
> >>>>>>
> >>>
> org.apache.manifoldcf.agents.transformationconnectorpool.TransformationConnectorPool.pollAllConnectors(TransformationConnectorPool.java:121)
>
> >>>
> >>>
> >>>>>> at
> >>>>>>
> >>>
> org.apache.manifoldcf.agents.system.IdleCleanupThread.run(IdleCleanupThread.java:91)
>
> >>>
> >>>
> >>>>>>
> >>>>>> is still happening in 2021 with the 2.20 version of MCF.
> >>>>>>
> >>>>>> Karl, you hypothesized that it could be related to Zookeeper being
> >>>>>> restarted while the MCF agent is still running, but after some
> >>>>>> investigations, my theory is that it is related to re-established
> >>>>>> sessions. Locks are not associated to a process but to a session,
> >>>>>> and it
> >>>>>> could happen that when a session is closed accidentally
> >>>>>> (interrupted by
> >>>>>> exceptions etc), it does not correctly release the locks it sets.
> >>>>>> When a
> >>>>>> new session is created by Zookeeper for the same client, the locks
> >>>>>> cannot be released because they belong to an old session and the
> >>>>>> exception is thrown !
> >>>>>>
> >>>>>> Is it something plausible for you ? I have no knowledge on Zookeeper
> >>>>>> but
> >>>>>> if it is something plausible, then it is worth investigating into
> >>>>>> the
> >>>>>> code to see if everything is correctly done to be sure that all
> >>>>>> locks
> >>>>>> are released when a session is closed/interrupted by a problem.
> >>>>>>
> >>>>>> Julien
> >>>>>>
> >>>>>> --
> >>>>>> L'absence de virus dans ce courrier électronique a été vérifiée
> >>>>>> par le
> >>>>>> logiciel antivirus Avast.
> >>>>>> https://www.avast.com/antivirus
> >>>>>>
> >>> --
> >>> L'absence de virus dans ce courrier électronique a été vérifiée par le
> >>> logiciel antivirus Avast.
> >>> https://www.avast.com/antivirus
> >>>
> >>>
>

Re: Zookeeper locks issue

Reply via email to