Hi, Not sure to understand what you mean by establishing a connection, what kind of connection ? If you can explain to me when the agent needs to set and release locks, I'll be able to better investigate.
Julien -----Message d'origine----- De : Karl Wright <daddy...@gmail.com> Envoyé : jeudi 9 décembre 2021 15:37 À : dev <dev@manifoldcf.apache.org> Objet : Re: Zookeeper locks issue The fact that you only see this on one job is pretty clearly evidence that we are seeing a hang of some kind due something a specific connector or connection is doing. I'm going to have to guess wildly here to focus us on a productive path. What I want to rule out is a case where the connector hangs while establishing a connection. If this can happen then I could well believe there would be a train wreck. Is this something you can confirm or disprove? Karl On Thu, Dec 9, 2021 at 9:07 AM Julien Massiera < julien.massi...@francelabs.com> wrote: > Actually, I have several jobs, but only one job is running at a time, > and currently the error always happens on the same one. The problem is > that I can't access the environment in debug mode, I also can't > activate debug log because I am limited in log size, so the only thing > I can do, is to add specific logs in specific places in the code to > try to understand what is happening. Where would you suggest me to add > log entries to optimise our chances to spot the issue ? > > Julien > > Le 09/12/2021 à 13:27, Karl Wright a écrit : > > The large number of connections can happen but usually that means > something > > is stuck somewhere and there is a "train wreck" of other locks > > getting backed up. > > > > If this is completely repeatable then I think we have an opportunity > > to figure out why this is happening. One thing that is clear is > > that this doesn't happen in other situations or in our integration > > tests, so that makes it necessary to ask what you may be doing differently > > here? > > > > I was operating on the assumption that the session just expires from > > lack of use, but in this case it may well be the other way around: > > something hangs elsewhere and a lock is held open for a very long > > time, long enough to exceed the timeout. If you have dozens of jobs > > running it might be a challenge to do this but if you can winnow it > > down to a small number the logs may give us a good picture of what is > > happening. > > > > Karl > > > > > > > > > > On Wed, Dec 8, 2021 at 3:55 PM Julien Massiera < > > julien.massi...@francelabs.com> wrote: > > > >> Hi, > >> > >> after having increased the session lifetime by 3, the lock error > >> still happens and the MCF agent hangs, so all my jobs also hang. > >> > >> Also, as I said in the other thread today, I notice a very large > >> amount of simultaneous connections from the agent to Zookeeper > >> (more than 1000) and I cannot tell if it is normal or not. > >> > >> Can we ignore that particular error and avoid to block an entire > >> MCF > node ? > >> > >> Julien > >> > >> Le 07/12/2021 à 22:15, Julien Massiera a écrit : > >>> Ok that makes sense. But still, I don't understand how the "Can't > >>> release lock we don't hold" exception can happen, knowing for sure > >>> that neither the Zookeeper process or the MCF agent process have > >>> been down and/or restarted. Not sure that increasing the session > >>> lifetime would solve that particular issue, and since I have no > >>> use case to easily reproduct it, it is very complicated to debug. > >>> > >>> Julien > >>> > >>> Le 07/12/2021 à 19:08, Karl Wright a écrit : > >>>> What this code is doing is interpreting exceptions back from > Zookeeper. > >>>> There are some kinds of exceptions it interprets as "session has > >>>> expired", so it rebuilds the session. > >>>> > >>>> The code is written in such a way that the locks are presumed to > persist > >>>> beyond the session. In fact, if they do not persist beyond the > session, > >>>> there is a risk that proper locks won't be enforced. > >>>> > >>>> If I recall correctly, we have a number of integration tests that > >>>> exercise Zookeeper integration that are meant to allow sessions > >>>> to expire and > be > >>>> re-established. If what you say is true and information is > >>>> attached solely to a session, Zookeeper cannot possibly work as > >>>> the cross-process lock mechanism we use it for. And yet it is > >>>> used not just by us in this > way, > >>>> but by many other projects as well. > >>>> > >>>> So I think that the diagnosis that nodes in Zookeeper have > >>>> session affinity is not absolutely correct. It may be the case > >>>> that only one session > >>>> *owns* > >>>> a node, and if that session expires then the node goes away. In > >>>> that case I think the right approach is the modify the zookeeper > >>>> parameters to increase the session lifetime; I don't see any > >>>> other way to prevent > bad > >>>> things from happening. Presumably, if a session is created > >>>> within a process, and the process dies, the session does too. > >>>> > >>>> Kar > >>>> > >>>> > >>>> On Tue, Dec 7, 2021 at 11:54 AM Julien Massiera < > >>>> julien.massi...@francelabs.com> wrote: > >>>> > >>>>> Karl, > >>>>> > >>>>> I tried to understand the Zookeeper lock logic in the code, and > >>>>> the only thing I don't understand is the > >>>>> 'handleEphemeralNodeKeeperException' > >>>>> method that is called in the catch(KeeperException e) of every > >>>>> obtain/release lock method of the ZookeeperConnection class. > >>>>> > >>>>> This method sets the lockNode param to 'null', recreates a > >>>>> session > and > >>>>> recreates nodes but do not resets the lockNode param at the end. > >>>>> So, > as > >>>>> I understood it, if it happens it may result in the lock release > error > >>>>> that I mentioned because this error is triggered when the > >>>>> lockNode param is 'null'. > >>>>> > >>>>> The method is in the class > >>>>> org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection. If > >>>>> you > can > >>>>> take a look and tell me what you think about it, it would be great ! > >>>>> > >>>>> Thanks, > >>>>> > >>>>> Julien > >>>>> > >>>>> Le 07/12/2021 à 14:40, Julien Massiera a écrit : > >>>>>> Yes, I will then try the patch and see if it is working > >>>>>> > >>>>>> Regards, > >>>>>> > >>>>>> Julien > >>>>>> > >>>>>> Le 07/12/2021 à 14:28, Karl Wright a écrit : > >>>>>>> Yes, this is plausible. But I'm not sure what the solution is. > If a > >>>>>>> zookeeper session disappears, according to the documentation > >>>>>>> everything associated with that session should also disappear. > >>>>>>> > >>>>>>> So I guess we could catch this error and just ignore it, > >>>>>>> assuming that the session must be gone anyway? > >>>>>>> > >>>>>>> Karl > >>>>>>> > >>>>>>> > >>>>>>> On Tue, Dec 7, 2021 at 8:21 AM Julien Massiera < > >>>>>>> julien.massi...@francelabs.com> wrote: > >>>>>>> > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> the Zookeeper lock error mentioned in the before last comment > >>>>>>>> of this issue > >>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-1447: > >>>>>>>> > >>>>>>>> FATAL 2017-08-04 09:28:25,855 (Agents idle cleanup thread) - > >>>>>>>> Error > >>>>>>>> tossed: > >>>>>>>> Can't release lock we don't hold > >>>>>>>> java.lang.IllegalStateException: Can't release lock we don't > >>>>>>>> hold at > >>>>>>>> > >> > org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection.releaseLock > (ZooKeeperConnection.java:815) > >> > >>>>> > >>>>>>>> at > >>>>>>>> > >> > org.apache.manifoldcf.core.lockmanager.ZooKeeperLockObject.clearLock(Z > ooKeeperLockObject.java:218) > >> > >>>>> > >>>>>>>> at > >>>>>>>> > >> > org.apache.manifoldcf.core.lockmanager.ZooKeeperLockObject.clearGlobal > WriteLockNoWait(ZooKeeperLockObject.java:100) > >> > >>>>> > >>>>>>>> at > >>>>>>>> > >> > org.apache.manifoldcf.core.lockmanager.LockObject.clearGlobalWriteLock > (LockObject.java:160) > >> > >>>>> > >>>>>>>> at > >>>>>>>> > >> > org.apache.manifoldcf.core.lockmanager.LockObject.leaveWriteLock(LockO > bject.java:141) > >> > >>>>> > >>>>>>>> at > >>>>>>>> > >> > org.apache.manifoldcf.core.lockmanager.LockGate.leaveWriteLock(LockGat > e.java:205) > >> > >>>>> > >>>>>>>> at > >>>>>>>> > >> > org.apache.manifoldcf.core.lockmanager.BaseLockManager.leaveWrite(Base > LockManager.java:1224) > >> > >>>>> > >>>>>>>> at > >>>>>>>> > >> > org.apache.manifoldcf.core.lockmanager.BaseLockManager.leaveWriteLock( > BaseLockManager.java:771) > >> > >>>>> > >>>>>>>> at > >>>>>>>> > >> > org.apache.manifoldcf.core.connectorpool.ConnectorPool$Pool.pollAll(Co > nnectorPool.java:670) > >> > >>>>> > >>>>>>>> at > >>>>>>>> > >> > org.apache.manifoldcf.core.connectorpool.ConnectorPool.pollAllConnecto > rs(ConnectorPool.java:338) > >> > >>>>> > >>>>>>>> at > >>>>>>>> > >> > org.apache.manifoldcf.agents.transformationconnectorpool.Transformatio > nConnectorPool.pollAllConnectors(TransformationConnectorPool.java:121) > >> > >>>>> > >>>>>>>> at > >>>>>>>> > >> > org.apache.manifoldcf.agents.system.IdleCleanupThread.run(IdleCleanupT > hread.java:91) > >> > >>>>> > >>>>>>>> is still happening in 2021 with the 2.20 version of MCF. > >>>>>>>> > >>>>>>>> Karl, you hypothesized that it could be related to Zookeeper > >>>>>>>> being restarted while the MCF agent is still running, but > >>>>>>>> after some investigations, my theory is that it is related to > >>>>>>>> re-established sessions. Locks are not associated to a > >>>>>>>> process but to a session, and it could happen that when a > >>>>>>>> session is closed accidentally (interrupted by exceptions > >>>>>>>> etc), it does not correctly release the locks it sets. > >>>>>>>> When a > >>>>>>>> new session is created by Zookeeper for the same client, the > >>>>>>>> locks cannot be released because they belong to an old > >>>>>>>> session and the exception is thrown ! > >>>>>>>> > >>>>>>>> Is it something plausible for you ? I have no knowledge on > Zookeeper > >>>>>>>> but > >>>>>>>> if it is something plausible, then it is worth investigating > >>>>>>>> into the code to see if everything is correctly done to be > >>>>>>>> sure that all locks are released when a session is > >>>>>>>> closed/interrupted by a problem. > >>>>>>>> > >>>>>>>> Julien > >>>>>>>> > >>>>>>>> -- > >>>>>>>> L'absence de virus dans ce courrier électronique a été > >>>>>>>> vérifiée par le logiciel antivirus Avast. > >>>>>>>> https://www.avast.com/antivirus > >>>>>>>> > >>>>> -- > >>>>> L'absence de virus dans ce courrier électronique a été vérifiée > >>>>> par > le > >>>>> logiciel antivirus Avast. > >>>>> https://www.avast.com/antivirus > >>>>> > >>>>> >