Re: Zookeeper locks issue

2021-12-07 Thread Julien Massiera
Ok that makes sense. But still, I don't understand how the "Can't 
release lock we don't hold" exception can happen, knowing for sure that 
neither the Zookeeper process or the MCF agent process have been down 
and/or restarted. Not sure that increasing the session lifetime would 
solve that particular issue, and since I have no use case to easily 
reproduct it, it is very complicated to debug.


Julien

Le 07/12/2021 à 19:08, Karl Wright a écrit :

What this code is doing is interpreting exceptions back from Zookeeper.
There are some kinds of exceptions it interprets as "session has expired",
so it rebuilds the session.

The code is written in such a way that the locks are presumed to persist
beyond the session.  In fact, if they do not persist beyond the session,
there is a risk that proper locks won't be enforced.

If I recall correctly, we have a number of integration tests that exercise
Zookeeper integration that are meant to allow sessions to expire and be
re-established.  If what you say is true and information is attached solely
to a session, Zookeeper cannot possibly work as the cross-process lock
mechanism we use it for.  And yet it is used not just by us in this way,
but by many other projects as well.

So I think that the diagnosis that nodes in Zookeeper have session affinity
is not absolutely correct. It may be the case that only one session *owns*
a node, and if that session expires then the node goes away.  In that case
I think the right approach is the modify the zookeeper parameters to
increase the session lifetime; I don't see any other way to prevent bad
things from happening.  Presumably, if a session is created within a
process, and the process dies, the session does too.

Kar


On Tue, Dec 7, 2021 at 11:54 AM Julien Massiera <
julien.massi...@francelabs.com> wrote:


Karl,

I tried to understand the Zookeeper lock logic in the code, and the only
thing I don't understand is the 'handleEphemeralNodeKeeperException'
method that is called in the catch(KeeperException e) of every
obtain/release lock method of the ZookeeperConnection class.

This method sets the lockNode param to 'null', recreates a session and
recreates nodes but do not resets the lockNode param at the end. So, as
I understood it, if it happens it may result in the lock release error
that I mentioned because this error is triggered when the lockNode param
is 'null'.

The method is in the class
org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection. If you can
take a look and tell me what you think about it, it would be great !

Thanks,

Julien

Le 07/12/2021 à 14:40, Julien Massiera a écrit :

Yes, I will then try the patch and see if it is working

Regards,

Julien

Le 07/12/2021 à 14:28, Karl Wright a écrit :

Yes, this is plausible.  But I'm not sure what the solution is.  If a
zookeeper session disappears, according to the documentation everything
associated with that session should also disappear.

So I guess we could catch this error and just ignore it, assuming
that the
session must be gone anyway?

Karl


On Tue, Dec 7, 2021 at 8:21 AM Julien Massiera <
julien.massi...@francelabs.com> wrote:


Hi,

the Zookeeper lock error mentioned in the before last comment of this
issue https://issues.apache.org/jira/browse/CONNECTORS-1447:

FATAL 2017-08-04 09:28:25,855 (Agents idle cleanup thread) - Error
tossed:
Can't release lock we don't hold
java.lang.IllegalStateException: Can't release lock we don't hold
at


org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection.releaseLock(ZooKeeperConnection.java:815)


at


org.apache.manifoldcf.core.lockmanager.ZooKeeperLockObject.clearLock(ZooKeeperLockObject.java:218)


at


org.apache.manifoldcf.core.lockmanager.ZooKeeperLockObject.clearGlobalWriteLockNoWait(ZooKeeperLockObject.java:100)


at


org.apache.manifoldcf.core.lockmanager.LockObject.clearGlobalWriteLock(LockObject.java:160)


at


org.apache.manifoldcf.core.lockmanager.LockObject.leaveWriteLock(LockObject.java:141)


at


org.apache.manifoldcf.core.lockmanager.LockGate.leaveWriteLock(LockGate.java:205)


at


org.apache.manifoldcf.core.lockmanager.BaseLockManager.leaveWrite(BaseLockManager.java:1224)


at


org.apache.manifoldcf.core.lockmanager.BaseLockManager.leaveWriteLock(BaseLockManager.java:771)


at


org.apache.manifoldcf.core.connectorpool.ConnectorPool$Pool.pollAll(ConnectorPool.java:670)


at


org.apache.manifoldcf.core.connectorpool.ConnectorPool.pollAllConnectors(ConnectorPool.java:338)


at


org.apache.manifoldcf.agents.transformationconnectorpool.TransformationConnectorPool.pollAllConnectors(TransformationConnectorPool.java:121)


at


org.apache.manifoldcf.agents.system.IdleCleanupThread.run(IdleCleanupThread.java:91)



is still happening in 2021 with the 2.20 version of MCF.

Karl, you hypothesized that it could be related to Zookeeper being
restarted while the MCF agent is still running, but after some
investigations, my theory is that it is related to re-established
sessions. Locks are 

Re: Zookeeper locks issue

2021-12-07 Thread Karl Wright
What this code is doing is interpreting exceptions back from Zookeeper.
There are some kinds of exceptions it interprets as "session has expired",
so it rebuilds the session.

The code is written in such a way that the locks are presumed to persist
beyond the session.  In fact, if they do not persist beyond the session,
there is a risk that proper locks won't be enforced.

If I recall correctly, we have a number of integration tests that exercise
Zookeeper integration that are meant to allow sessions to expire and be
re-established.  If what you say is true and information is attached solely
to a session, Zookeeper cannot possibly work as the cross-process lock
mechanism we use it for.  And yet it is used not just by us in this way,
but by many other projects as well.

So I think that the diagnosis that nodes in Zookeeper have session affinity
is not absolutely correct. It may be the case that only one session *owns*
a node, and if that session expires then the node goes away.  In that case
I think the right approach is the modify the zookeeper parameters to
increase the session lifetime; I don't see any other way to prevent bad
things from happening.  Presumably, if a session is created within a
process, and the process dies, the session does too.

Kar


On Tue, Dec 7, 2021 at 11:54 AM Julien Massiera <
julien.massi...@francelabs.com> wrote:

> Karl,
>
> I tried to understand the Zookeeper lock logic in the code, and the only
> thing I don't understand is the 'handleEphemeralNodeKeeperException'
> method that is called in the catch(KeeperException e) of every
> obtain/release lock method of the ZookeeperConnection class.
>
> This method sets the lockNode param to 'null', recreates a session and
> recreates nodes but do not resets the lockNode param at the end. So, as
> I understood it, if it happens it may result in the lock release error
> that I mentioned because this error is triggered when the lockNode param
> is 'null'.
>
> The method is in the class
> org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection. If you can
> take a look and tell me what you think about it, it would be great !
>
> Thanks,
>
> Julien
>
> Le 07/12/2021 à 14:40, Julien Massiera a écrit :
> > Yes, I will then try the patch and see if it is working
> >
> > Regards,
> >
> > Julien
> >
> > Le 07/12/2021 à 14:28, Karl Wright a écrit :
> >> Yes, this is plausible.  But I'm not sure what the solution is.  If a
> >> zookeeper session disappears, according to the documentation everything
> >> associated with that session should also disappear.
> >>
> >> So I guess we could catch this error and just ignore it, assuming
> >> that the
> >> session must be gone anyway?
> >>
> >> Karl
> >>
> >>
> >> On Tue, Dec 7, 2021 at 8:21 AM Julien Massiera <
> >> julien.massi...@francelabs.com> wrote:
> >>
> >>> Hi,
> >>>
> >>> the Zookeeper lock error mentioned in the before last comment of this
> >>> issue https://issues.apache.org/jira/browse/CONNECTORS-1447:
> >>>
> >>> FATAL 2017-08-04 09:28:25,855 (Agents idle cleanup thread) - Error
> >>> tossed:
> >>> Can't release lock we don't hold
> >>> java.lang.IllegalStateException: Can't release lock we don't hold
> >>> at
> >>>
> org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection.releaseLock(ZooKeeperConnection.java:815)
>
> >>>
> >>> at
> >>>
> org.apache.manifoldcf.core.lockmanager.ZooKeeperLockObject.clearLock(ZooKeeperLockObject.java:218)
>
> >>>
> >>> at
> >>>
> org.apache.manifoldcf.core.lockmanager.ZooKeeperLockObject.clearGlobalWriteLockNoWait(ZooKeeperLockObject.java:100)
>
> >>>
> >>> at
> >>>
> org.apache.manifoldcf.core.lockmanager.LockObject.clearGlobalWriteLock(LockObject.java:160)
>
> >>>
> >>> at
> >>>
> org.apache.manifoldcf.core.lockmanager.LockObject.leaveWriteLock(LockObject.java:141)
>
> >>>
> >>> at
> >>>
> org.apache.manifoldcf.core.lockmanager.LockGate.leaveWriteLock(LockGate.java:205)
>
> >>>
> >>> at
> >>>
> org.apache.manifoldcf.core.lockmanager.BaseLockManager.leaveWrite(BaseLockManager.java:1224)
>
> >>>
> >>> at
> >>>
> org.apache.manifoldcf.core.lockmanager.BaseLockManager.leaveWriteLock(BaseLockManager.java:771)
>
> >>>
> >>> at
> >>>
> org.apache.manifoldcf.core.connectorpool.ConnectorPool$Pool.pollAll(ConnectorPool.java:670)
>
> >>>
> >>> at
> >>>
> org.apache.manifoldcf.core.connectorpool.ConnectorPool.pollAllConnectors(ConnectorPool.java:338)
>
> >>>
> >>> at
> >>>
> org.apache.manifoldcf.agents.transformationconnectorpool.TransformationConnectorPool.pollAllConnectors(TransformationConnectorPool.java:121)
>
> >>>
> >>> at
> >>>
> org.apache.manifoldcf.agents.system.IdleCleanupThread.run(IdleCleanupThread.java:91)
>
> >>>
> >>>
> >>> is still happening in 2021 with the 2.20 version of MCF.
> >>>
> >>> Karl, you hypothesized that it could be related to Zookeeper being
> >>> restarted while the MCF agent is still running, but after some
> >>> investigations, my theory is that it is related to re-established
> >>> sessions. Locks are not associated to a 

Re: Zookeeper locks issue

2021-12-07 Thread Julien Massiera

Karl,

I tried to understand the Zookeeper lock logic in the code, and the only 
thing I don't understand is the 'handleEphemeralNodeKeeperException' 
method that is called in the catch(KeeperException e) of every 
obtain/release lock method of the ZookeeperConnection class.


This method sets the lockNode param to 'null', recreates a session and 
recreates nodes but do not resets the lockNode param at the end. So, as 
I understood it, if it happens it may result in the lock release error 
that I mentioned because this error is triggered when the lockNode param 
is 'null'.


The method is in the class 
org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection. If you can 
take a look and tell me what you think about it, it would be great !


Thanks,

Julien

Le 07/12/2021 à 14:40, Julien Massiera a écrit :

Yes, I will then try the patch and see if it is working

Regards,

Julien

Le 07/12/2021 à 14:28, Karl Wright a écrit :

Yes, this is plausible.  But I'm not sure what the solution is.  If a
zookeeper session disappears, according to the documentation everything
associated with that session should also disappear.

So I guess we could catch this error and just ignore it, assuming 
that the

session must be gone anyway?

Karl


On Tue, Dec 7, 2021 at 8:21 AM Julien Massiera <
julien.massi...@francelabs.com> wrote:


Hi,

the Zookeeper lock error mentioned in the before last comment of this
issue https://issues.apache.org/jira/browse/CONNECTORS-1447:

FATAL 2017-08-04 09:28:25,855 (Agents idle cleanup thread) - Error 
tossed:

Can't release lock we don't hold
java.lang.IllegalStateException: Can't release lock we don't hold
at
org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection.releaseLock(ZooKeeperConnection.java:815) 


at
org.apache.manifoldcf.core.lockmanager.ZooKeeperLockObject.clearLock(ZooKeeperLockObject.java:218) 


at
org.apache.manifoldcf.core.lockmanager.ZooKeeperLockObject.clearGlobalWriteLockNoWait(ZooKeeperLockObject.java:100) 


at
org.apache.manifoldcf.core.lockmanager.LockObject.clearGlobalWriteLock(LockObject.java:160) 


at
org.apache.manifoldcf.core.lockmanager.LockObject.leaveWriteLock(LockObject.java:141) 


at
org.apache.manifoldcf.core.lockmanager.LockGate.leaveWriteLock(LockGate.java:205) 


at
org.apache.manifoldcf.core.lockmanager.BaseLockManager.leaveWrite(BaseLockManager.java:1224) 


at
org.apache.manifoldcf.core.lockmanager.BaseLockManager.leaveWriteLock(BaseLockManager.java:771) 


at
org.apache.manifoldcf.core.connectorpool.ConnectorPool$Pool.pollAll(ConnectorPool.java:670) 


at
org.apache.manifoldcf.core.connectorpool.ConnectorPool.pollAllConnectors(ConnectorPool.java:338) 


at
org.apache.manifoldcf.agents.transformationconnectorpool.TransformationConnectorPool.pollAllConnectors(TransformationConnectorPool.java:121) 


at
org.apache.manifoldcf.agents.system.IdleCleanupThread.run(IdleCleanupThread.java:91) 



is still happening in 2021 with the 2.20 version of MCF.

Karl, you hypothesized that it could be related to Zookeeper being
restarted while the MCF agent is still running, but after some
investigations, my theory is that it is related to re-established
sessions. Locks are not associated to a process but to a session, 
and it

could happen that when a session is closed accidentally (interrupted by
exceptions etc), it does not correctly release the locks it sets. 
When a

new session is created by Zookeeper for the same client, the locks
cannot be released because they belong to an old session and the
exception is thrown !

Is it something plausible for you ? I have no knowledge on Zookeeper 
but

if it is something plausible, then it is worth investigating into the
code to see if everything is correctly done to be sure that all locks
are released when a session is closed/interrupted by a problem.

Julien

--
L'absence de virus dans ce courrier électronique a été vérifiée par le
logiciel antivirus Avast.
https://www.avast.com/antivirus





--
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel 
antivirus Avast.
https://www.avast.com/antivirus



Re: Zookeeper locks issue

2021-12-07 Thread Julien Massiera

Yes, I will then try the patch and see if it is working

Regards,

Julien

Le 07/12/2021 à 14:28, Karl Wright a écrit :

Yes, this is plausible.  But I'm not sure what the solution is.  If a
zookeeper session disappears, according to the documentation everything
associated with that session should also disappear.

So I guess we could catch this error and just ignore it, assuming that the
session must be gone anyway?

Karl


On Tue, Dec 7, 2021 at 8:21 AM Julien Massiera <
julien.massi...@francelabs.com> wrote:


Hi,

the Zookeeper lock error mentioned in the before last comment of this
issue https://issues.apache.org/jira/browse/CONNECTORS-1447:

FATAL 2017-08-04 09:28:25,855 (Agents idle cleanup thread) - Error tossed:
Can't release lock we don't hold
java.lang.IllegalStateException: Can't release lock we don't hold
at
org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection.releaseLock(ZooKeeperConnection.java:815)
at
org.apache.manifoldcf.core.lockmanager.ZooKeeperLockObject.clearLock(ZooKeeperLockObject.java:218)
at
org.apache.manifoldcf.core.lockmanager.ZooKeeperLockObject.clearGlobalWriteLockNoWait(ZooKeeperLockObject.java:100)
at
org.apache.manifoldcf.core.lockmanager.LockObject.clearGlobalWriteLock(LockObject.java:160)
at
org.apache.manifoldcf.core.lockmanager.LockObject.leaveWriteLock(LockObject.java:141)
at
org.apache.manifoldcf.core.lockmanager.LockGate.leaveWriteLock(LockGate.java:205)
at
org.apache.manifoldcf.core.lockmanager.BaseLockManager.leaveWrite(BaseLockManager.java:1224)
at
org.apache.manifoldcf.core.lockmanager.BaseLockManager.leaveWriteLock(BaseLockManager.java:771)
at
org.apache.manifoldcf.core.connectorpool.ConnectorPool$Pool.pollAll(ConnectorPool.java:670)
at
org.apache.manifoldcf.core.connectorpool.ConnectorPool.pollAllConnectors(ConnectorPool.java:338)
at
org.apache.manifoldcf.agents.transformationconnectorpool.TransformationConnectorPool.pollAllConnectors(TransformationConnectorPool.java:121)
at
org.apache.manifoldcf.agents.system.IdleCleanupThread.run(IdleCleanupThread.java:91)

is still happening in 2021 with the 2.20 version of MCF.

Karl, you hypothesized that it could be related to Zookeeper being
restarted while the MCF agent is still running, but after some
investigations, my theory is that it is related to re-established
sessions. Locks are not associated to a process but to a session, and it
could happen that when a session is closed accidentally (interrupted by
exceptions etc), it does not correctly release the locks it sets. When a
new session is created by Zookeeper for the same client, the locks
cannot be released because they belong to an old session and the
exception is thrown !

Is it something plausible for you ? I have no knowledge on Zookeeper but
if it is something plausible, then it is worth investigating into the
code to see if everything is correctly done to be sure that all locks
are released when a session is closed/interrupted by a problem.

Julien

--
L'absence de virus dans ce courrier électronique a été vérifiée par le
logiciel antivirus Avast.
https://www.avast.com/antivirus



--
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel 
antivirus Avast.
https://www.avast.com/antivirus



Re: Zookeeper locks issue

2021-12-07 Thread Karl Wright
Yes, this is plausible.  But I'm not sure what the solution is.  If a
zookeeper session disappears, according to the documentation everything
associated with that session should also disappear.

So I guess we could catch this error and just ignore it, assuming that the
session must be gone anyway?

Karl


On Tue, Dec 7, 2021 at 8:21 AM Julien Massiera <
julien.massi...@francelabs.com> wrote:

> Hi,
>
> the Zookeeper lock error mentioned in the before last comment of this
> issue https://issues.apache.org/jira/browse/CONNECTORS-1447:
>
> FATAL 2017-08-04 09:28:25,855 (Agents idle cleanup thread) - Error tossed:
> Can't release lock we don't hold
> java.lang.IllegalStateException: Can't release lock we don't hold
> at
> org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection.releaseLock(ZooKeeperConnection.java:815)
> at
> org.apache.manifoldcf.core.lockmanager.ZooKeeperLockObject.clearLock(ZooKeeperLockObject.java:218)
> at
> org.apache.manifoldcf.core.lockmanager.ZooKeeperLockObject.clearGlobalWriteLockNoWait(ZooKeeperLockObject.java:100)
> at
> org.apache.manifoldcf.core.lockmanager.LockObject.clearGlobalWriteLock(LockObject.java:160)
> at
> org.apache.manifoldcf.core.lockmanager.LockObject.leaveWriteLock(LockObject.java:141)
> at
> org.apache.manifoldcf.core.lockmanager.LockGate.leaveWriteLock(LockGate.java:205)
> at
> org.apache.manifoldcf.core.lockmanager.BaseLockManager.leaveWrite(BaseLockManager.java:1224)
> at
> org.apache.manifoldcf.core.lockmanager.BaseLockManager.leaveWriteLock(BaseLockManager.java:771)
> at
> org.apache.manifoldcf.core.connectorpool.ConnectorPool$Pool.pollAll(ConnectorPool.java:670)
> at
> org.apache.manifoldcf.core.connectorpool.ConnectorPool.pollAllConnectors(ConnectorPool.java:338)
> at
> org.apache.manifoldcf.agents.transformationconnectorpool.TransformationConnectorPool.pollAllConnectors(TransformationConnectorPool.java:121)
> at
> org.apache.manifoldcf.agents.system.IdleCleanupThread.run(IdleCleanupThread.java:91)
>
> is still happening in 2021 with the 2.20 version of MCF.
>
> Karl, you hypothesized that it could be related to Zookeeper being
> restarted while the MCF agent is still running, but after some
> investigations, my theory is that it is related to re-established
> sessions. Locks are not associated to a process but to a session, and it
> could happen that when a session is closed accidentally (interrupted by
> exceptions etc), it does not correctly release the locks it sets. When a
> new session is created by Zookeeper for the same client, the locks
> cannot be released because they belong to an old session and the
> exception is thrown !
>
> Is it something plausible for you ? I have no knowledge on Zookeeper but
> if it is something plausible, then it is worth investigating into the
> code to see if everything is correctly done to be sure that all locks
> are released when a session is closed/interrupted by a problem.
>
> Julien
>
> --
> L'absence de virus dans ce courrier électronique a été vérifiée par le
> logiciel antivirus Avast.
> https://www.avast.com/antivirus
>


Zookeeper locks issue

2021-12-07 Thread Julien Massiera

Hi,

the Zookeeper lock error mentioned in the before last comment of this
issue https://issues.apache.org/jira/browse/CONNECTORS-1447:

FATAL 2017-08-04 09:28:25,855 (Agents idle cleanup thread) - Error tossed: 
Can't release lock we don't hold
java.lang.IllegalStateException: Can't release lock we don't hold
at 
org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection.releaseLock(ZooKeeperConnection.java:815)
at 
org.apache.manifoldcf.core.lockmanager.ZooKeeperLockObject.clearLock(ZooKeeperLockObject.java:218)
at 
org.apache.manifoldcf.core.lockmanager.ZooKeeperLockObject.clearGlobalWriteLockNoWait(ZooKeeperLockObject.java:100)
at 
org.apache.manifoldcf.core.lockmanager.LockObject.clearGlobalWriteLock(LockObject.java:160)
at 
org.apache.manifoldcf.core.lockmanager.LockObject.leaveWriteLock(LockObject.java:141)
at 
org.apache.manifoldcf.core.lockmanager.LockGate.leaveWriteLock(LockGate.java:205)
at 
org.apache.manifoldcf.core.lockmanager.BaseLockManager.leaveWrite(BaseLockManager.java:1224)
at 
org.apache.manifoldcf.core.lockmanager.BaseLockManager.leaveWriteLock(BaseLockManager.java:771)
at 
org.apache.manifoldcf.core.connectorpool.ConnectorPool$Pool.pollAll(ConnectorPool.java:670)
at 
org.apache.manifoldcf.core.connectorpool.ConnectorPool.pollAllConnectors(ConnectorPool.java:338)
at 
org.apache.manifoldcf.agents.transformationconnectorpool.TransformationConnectorPool.pollAllConnectors(TransformationConnectorPool.java:121)
at 
org.apache.manifoldcf.agents.system.IdleCleanupThread.run(IdleCleanupThread.java:91)

is still happening in 2021 with the 2.20 version of MCF.

Karl, you hypothesized that it could be related to Zookeeper being
restarted while the MCF agent is still running, but after some
investigations, my theory is that it is related to re-established
sessions. Locks are not associated to a process but to a session, and it
could happen that when a session is closed accidentally (interrupted by
exceptions etc), it does not correctly release the locks it sets. When a
new session is created by Zookeeper for the same client, the locks
cannot be released because they belong to an old session and the
exception is thrown !

Is it something plausible for you ? I have no knowledge on Zookeeper but
if it is something plausible, then it is worth investigating into the
code to see if everything is correctly done to be sure that all locks
are released when a session is closed/interrupted by a problem.

Julien

--
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel 
antivirus Avast.
https://www.avast.com/antivirus