[
https://issues.apache.org/jira/browse/CURATOR-644?focusedWorklogId=804381&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-804381
]
ASF GitHub Bot logged work on CURATOR-644:
------------------------------------------
Author: ASF GitHub Bot
Created on: 29/Aug/22 13:44
Start Date: 29/Aug/22 13:44
Worklog Time Spent: 10m
Work Description: XComp commented on code in PR #430:
URL: https://github.com/apache/curator/pull/430#discussion_r957341820
##########
curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java:
##########
@@ -667,9 +670,9 @@ protected void handleStateChange(ConnectionState newState)
{
try
{
- if (
client.getConnectionStateErrorPolicy().isErrorState(ConnectionState.SUSPENDED)
|| !hasLeadership.get() )
Review Comment:
I'm wondering whether we should remove this `if` entirely: Let's assume the
current instance doesn't have the leadership but watches the child node
referring to the current leader. If the connection was temporarily suspended it
could be that the actual leader lost its leadership in the meantime. In that
case, we want the current instance to check the leadership. :thinking:
##########
curator-recipes/src/test/java/org/apache/curator/framework/recipes/leader/TestLeaderLatch.java:
##########
@@ -218,6 +218,30 @@ public void testWatchedNodeDeletedOnReconnect() throws
Exception
}
}
+ @Test
+ public void testOurPathDeletedOnReconnect() throws Exception
Review Comment:
The test is not suitable for the change. I reverted the change and ran the
test. The test still succeeded.
##########
curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java:
##########
@@ -667,9 +670,9 @@ protected void handleStateChange(ConnectionState newState)
{
try
{
- if (
client.getConnectionStateErrorPolicy().isErrorState(ConnectionState.SUSPENDED)
|| !hasLeadership.get() )
Review Comment:
But thinking about it once more, this might not be an issue because the
current LeaderLatch would have a watcher on the current leader's node which
would have been triggered when the leader's child would have been deleted.
:thinking:
##########
curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java:
##########
@@ -604,7 +604,7 @@ else if ( ourIndex == 0 )
@Override
public void process(WatchedEvent event)
{
- if ( (state.get() == State.STARTED) && (event.getType() ==
Event.EventType.NodeDeleted) && (localOurPath != null) )
Review Comment:
Good point. Due to the condition in
[LeaderLatch:588](https://github.com/apache/curator/blob/425598cb6bf6a5b227a5fdd293fe46c7978d6578/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L588),
`ourIndex` must be `-1` if `localOurPath` is actually `null` and, therefore,
would prevent the execution from reaching the `else` branch (instead the if
condition in
[LeaderLatch:592](https://github.com/apache/curator/blob/425598cb6bf6a5b227a5fdd293fe46c7978d6578/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L592)
applies) where the watcher is set.
Issue Time Tracking
-------------------
Worklog Id: (was: 804381)
Time Spent: 1.5h (was: 1h 20m)
> CLONE - Race conditions in LeaderLatch after reconnecting to ensemble
> ---------------------------------------------------------------------
>
> Key: CURATOR-644
> URL: https://issues.apache.org/jira/browse/CURATOR-644
> Project: Apache Curator
> Issue Type: Bug
> Affects Versions: 4.2.0
> Reporter: Ken Huang
> Assignee: Jordan Zimmerman
> Priority: Minor
> Time Spent: 1.5h
> Remaining Estimate: 0h
>
> Clone from CURATOR-504.
> We use LeaderLatch in a lot of places in our system and when ZooKeeper
> ensemble is unstable and clients are reconnecting to logs are full of
> messages like the following:
> {{{}[2017-08-31
> 19:18:34,562][ERROR][org.apache.curator.framework.recipes.leader.LeaderLatch]
> Can't find our node. Resetting. Index: -1 {{}}}}
> According to the
> [implementation|https://github.com/apache/curator/blob/4251fe328908e5fca37af034fabc190aa452c73f/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L529-L536],
> this can happen in two cases:
> * When internal state `ourPath` is null
> * When the list of latches does not have the expected one.
> I believe we hit the first condition because of races that occur after client
> reconnects to ZooKeeper.
> * Client reconnects to ZooKeeper and LeaderLatch gets the event and calls
> reset method which set the internal state (`ourPath`) to null, removes old
> latch and creates a new one. This happens in thread
> "Curator-ConnectionStateManager-0".
> * Almost simultaneously, LeaderLatch gets another even NodeDeleted
> ([here|https://github.com/apache/curator/blob/4251fe328908e5fca37af034fabc190aa452c73f/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L543-L554])
> and tries to re-read the list of latches and check leadership. This happens
> in the thread "main-EventThread".
> Therefore, sometimes there is a situation when method `checkLeadership` is
> called when `ourPath` is null.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)