Hi,

Not sure to understand what you mean by establishing a connection, what kind of 
connection ? If you can explain to me when the agent needs to set and release 
locks, I'll be able to better investigate. 

Julien 

-----Message d'origine-----
De : Karl Wright <daddy...@gmail.com> 
Envoyé : jeudi 9 décembre 2021 15:37
À : dev <dev@manifoldcf.apache.org>
Objet : Re: Zookeeper locks issue

The fact that you only see this on one job is pretty clearly evidence that we 
are seeing a hang of some kind due something a specific connector or connection 
is doing.

I'm going to have to guess wildly here to focus us on a productive path.
What I want to rule out is a case where the connector hangs while establishing 
a connection.  If this can happen then I could well believe there would be a 
train wreck.  Is this something you can confirm or disprove?

Karl


On Thu, Dec 9, 2021 at 9:07 AM Julien Massiera < 
julien.massi...@francelabs.com> wrote:

> Actually, I have several jobs, but only one job is running at a time, 
> and currently the error always happens on the same one. The problem is 
> that I can't access the environment in debug mode, I also can't 
> activate debug log because I am limited in log size, so the only thing 
> I can do, is to add specific logs in specific places in the code to 
> try to understand what is happening. Where would you suggest me to add 
> log entries to optimise our chances to spot the issue ?
>
> Julien
>
> Le 09/12/2021 à 13:27, Karl Wright a écrit :
> > The large number of connections can happen but usually that means
> something
> > is stuck somewhere and there is a "train wreck" of other locks 
> > getting backed up.
> >
> > If this is completely repeatable then I think we have an opportunity 
> > to figure out why this is happening.  One thing that is clear is 
> > that this doesn't happen in other situations or in our integration 
> > tests, so that makes it necessary to ask what you may be doing differently 
> > here?
> >
> > I was operating on the assumption that the session just expires from 
> > lack of use, but in this case it may well be the other way around: 
> > something hangs elsewhere and a lock is held open for a very long 
> > time, long enough to exceed the timeout.  If you have dozens of jobs 
> > running it might be a challenge to do this but if you can winnow it 
> > down to a small number the logs may give us a good picture of what is 
> > happening.
> >
> > Karl
> >
> >
> >
> >
> > On Wed, Dec 8, 2021 at 3:55 PM Julien Massiera < 
> > julien.massi...@francelabs.com> wrote:
> >
> >> Hi,
> >>
> >> after having increased the session lifetime by 3, the lock error 
> >> still happens and the MCF agent hangs, so all my jobs also hang.
> >>
> >> Also, as I said in the other thread today, I notice a very large 
> >> amount of simultaneous connections from the agent to Zookeeper 
> >> (more than 1000) and I cannot tell if it is normal or not.
> >>
> >> Can we ignore that particular error and avoid to block an entire 
> >> MCF
> node ?
> >>
> >> Julien
> >>
> >> Le 07/12/2021 à 22:15, Julien Massiera a écrit :
> >>> Ok that makes sense. But still, I don't understand how the "Can't 
> >>> release lock we don't hold" exception can happen, knowing for sure 
> >>> that neither the Zookeeper process or the MCF agent process have 
> >>> been down and/or restarted. Not sure that increasing the session 
> >>> lifetime would solve that particular issue, and since I have no 
> >>> use case to easily reproduct it, it is very complicated to debug.
> >>>
> >>> Julien
> >>>
> >>> Le 07/12/2021 à 19:08, Karl Wright a écrit :
> >>>> What this code is doing is interpreting exceptions back from
> Zookeeper.
> >>>> There are some kinds of exceptions it interprets as "session has 
> >>>> expired", so it rebuilds the session.
> >>>>
> >>>> The code is written in such a way that the locks are presumed to
> persist
> >>>> beyond the session.  In fact, if they do not persist beyond the
> session,
> >>>> there is a risk that proper locks won't be enforced.
> >>>>
> >>>> If I recall correctly, we have a number of integration tests that 
> >>>> exercise Zookeeper integration that are meant to allow sessions 
> >>>> to expire and
> be
> >>>> re-established.  If what you say is true and information is 
> >>>> attached solely to a session, Zookeeper cannot possibly work as 
> >>>> the cross-process lock mechanism we use it for.  And yet it is 
> >>>> used not just by us in this
> way,
> >>>> but by many other projects as well.
> >>>>
> >>>> So I think that the diagnosis that nodes in Zookeeper have 
> >>>> session affinity is not absolutely correct. It may be the case 
> >>>> that only one session
> >>>> *owns*
> >>>> a node, and if that session expires then the node goes away.  In 
> >>>> that case I think the right approach is the modify the zookeeper 
> >>>> parameters to increase the session lifetime; I don't see any 
> >>>> other way to prevent
> bad
> >>>> things from happening.  Presumably, if a session is created 
> >>>> within a process, and the process dies, the session does too.
> >>>>
> >>>> Kar
> >>>>
> >>>>
> >>>> On Tue, Dec 7, 2021 at 11:54 AM Julien Massiera < 
> >>>> julien.massi...@francelabs.com> wrote:
> >>>>
> >>>>> Karl,
> >>>>>
> >>>>> I tried to understand the Zookeeper lock logic in the code, and 
> >>>>> the only thing I don't understand is the 
> >>>>> 'handleEphemeralNodeKeeperException'
> >>>>> method that is called in the catch(KeeperException e) of every 
> >>>>> obtain/release lock method of the ZookeeperConnection class.
> >>>>>
> >>>>> This method sets the lockNode param to 'null', recreates a 
> >>>>> session
> and
> >>>>> recreates nodes but do not resets the lockNode param at the end. 
> >>>>> So,
> as
> >>>>> I understood it, if it happens it may result in the lock release
> error
> >>>>> that I mentioned because this error is triggered when the 
> >>>>> lockNode param is 'null'.
> >>>>>
> >>>>> The method is in the class
> >>>>> org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection. If 
> >>>>> you
> can
> >>>>> take a look and tell me what you think about it, it would be great !
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Julien
> >>>>>
> >>>>> Le 07/12/2021 à 14:40, Julien Massiera a écrit :
> >>>>>> Yes, I will then try the patch and see if it is working
> >>>>>>
> >>>>>> Regards,
> >>>>>>
> >>>>>> Julien
> >>>>>>
> >>>>>> Le 07/12/2021 à 14:28, Karl Wright a écrit :
> >>>>>>> Yes, this is plausible.  But I'm not sure what the solution is.
> If a
> >>>>>>> zookeeper session disappears, according to the documentation 
> >>>>>>> everything associated with that session should also disappear.
> >>>>>>>
> >>>>>>> So I guess we could catch this error and just ignore it, 
> >>>>>>> assuming that the session must be gone anyway?
> >>>>>>>
> >>>>>>> Karl
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Dec 7, 2021 at 8:21 AM Julien Massiera < 
> >>>>>>> julien.massi...@francelabs.com> wrote:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> the Zookeeper lock error mentioned in the before last comment 
> >>>>>>>> of this issue 
> >>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-1447:
> >>>>>>>>
> >>>>>>>> FATAL 2017-08-04 09:28:25,855 (Agents idle cleanup thread) - 
> >>>>>>>> Error
> >>>>>>>> tossed:
> >>>>>>>> Can't release lock we don't hold
> >>>>>>>> java.lang.IllegalStateException: Can't release lock we don't 
> >>>>>>>> hold at
> >>>>>>>>
> >>
> org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection.releaseLock
> (ZooKeeperConnection.java:815)
> >>
> >>>>>
> >>>>>>>> at
> >>>>>>>>
> >>
> org.apache.manifoldcf.core.lockmanager.ZooKeeperLockObject.clearLock(Z
> ooKeeperLockObject.java:218)
> >>
> >>>>>
> >>>>>>>> at
> >>>>>>>>
> >>
> org.apache.manifoldcf.core.lockmanager.ZooKeeperLockObject.clearGlobal
> WriteLockNoWait(ZooKeeperLockObject.java:100)
> >>
> >>>>>
> >>>>>>>> at
> >>>>>>>>
> >>
> org.apache.manifoldcf.core.lockmanager.LockObject.clearGlobalWriteLock
> (LockObject.java:160)
> >>
> >>>>>
> >>>>>>>> at
> >>>>>>>>
> >>
> org.apache.manifoldcf.core.lockmanager.LockObject.leaveWriteLock(LockO
> bject.java:141)
> >>
> >>>>>
> >>>>>>>> at
> >>>>>>>>
> >>
> org.apache.manifoldcf.core.lockmanager.LockGate.leaveWriteLock(LockGat
> e.java:205)
> >>
> >>>>>
> >>>>>>>> at
> >>>>>>>>
> >>
> org.apache.manifoldcf.core.lockmanager.BaseLockManager.leaveWrite(Base
> LockManager.java:1224)
> >>
> >>>>>
> >>>>>>>> at
> >>>>>>>>
> >>
> org.apache.manifoldcf.core.lockmanager.BaseLockManager.leaveWriteLock(
> BaseLockManager.java:771)
> >>
> >>>>>
> >>>>>>>> at
> >>>>>>>>
> >>
> org.apache.manifoldcf.core.connectorpool.ConnectorPool$Pool.pollAll(Co
> nnectorPool.java:670)
> >>
> >>>>>
> >>>>>>>> at
> >>>>>>>>
> >>
> org.apache.manifoldcf.core.connectorpool.ConnectorPool.pollAllConnecto
> rs(ConnectorPool.java:338)
> >>
> >>>>>
> >>>>>>>> at
> >>>>>>>>
> >>
> org.apache.manifoldcf.agents.transformationconnectorpool.Transformatio
> nConnectorPool.pollAllConnectors(TransformationConnectorPool.java:121)
> >>
> >>>>>
> >>>>>>>> at
> >>>>>>>>
> >>
> org.apache.manifoldcf.agents.system.IdleCleanupThread.run(IdleCleanupT
> hread.java:91)
> >>
> >>>>>
> >>>>>>>> is still happening in 2021 with the 2.20 version of MCF.
> >>>>>>>>
> >>>>>>>> Karl, you hypothesized that it could be related to Zookeeper 
> >>>>>>>> being restarted while the MCF agent is still running, but 
> >>>>>>>> after some investigations, my theory is that it is related to 
> >>>>>>>> re-established sessions. Locks are not associated to a 
> >>>>>>>> process but to a session, and it could happen that when a 
> >>>>>>>> session is closed accidentally (interrupted by exceptions 
> >>>>>>>> etc), it does not correctly release the locks it sets.
> >>>>>>>> When a
> >>>>>>>> new session is created by Zookeeper for the same client, the 
> >>>>>>>> locks cannot be released because they belong to an old 
> >>>>>>>> session and the exception is thrown !
> >>>>>>>>
> >>>>>>>> Is it something plausible for you ? I have no knowledge on
> Zookeeper
> >>>>>>>> but
> >>>>>>>> if it is something plausible, then it is worth investigating 
> >>>>>>>> into the code to see if everything is correctly done to be 
> >>>>>>>> sure that all locks are released when a session is 
> >>>>>>>> closed/interrupted by a problem.
> >>>>>>>>
> >>>>>>>> Julien
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> L'absence de virus dans ce courrier électronique a été 
> >>>>>>>> vérifiée par le logiciel antivirus Avast.
> >>>>>>>> https://www.avast.com/antivirus
> >>>>>>>>
> >>>>> --
> >>>>> L'absence de virus dans ce courrier électronique a été vérifiée 
> >>>>> par
> le
> >>>>> logiciel antivirus Avast.
> >>>>> https://www.avast.com/antivirus
> >>>>>
> >>>>>
>

Reply via email to