Patson Luk created SOLR-16412:
---------------------------------

             Summary: Race condition could trigger error on concurrent 
SizeLimitedDistributedMap cleanup
                 Key: SOLR-16412
                 URL: https://issues.apache.org/jira/browse/SOLR-16412
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: SolrCloud
    Affects Versions: 8.8, main (10.0)
            Reporter: Patson Luk


h2. Description

Exception below is observed while updating the `completedMap` field in 
`OverseerTaskProcessor` :

{{o.a.s.c.OverseerTaskProcessor 
:org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode 
for 
/overseer/collection-map-completed/mn-736f6c726d616e2d312d31383930383730393837313333303932353331}}
{{at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)}}
{{at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)}}
{{at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:2001)}}
{{at 
org.apache.solr.common.cloud.SolrZkClient.lambda$delete$1(SolrZkClient.java:264)}}
{{at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:71)}}
{{at org.apache.solr.common.cloud.SolrZkClient.delete(SolrZkClient.java:263)}}
{{at 
org.apache.solr.cloud.SizeLimitedDistributedMap.put(SizeLimitedDistributedMap.java:76)}}
{{at 
org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:538)}}
{{at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:218)}}
{{at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)}}
{{at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)}}


h2. Cause

Based on the stack trace, `SizeLimitedDistributedMap` had reached the limit and 
attempted to cleanup entries:
[https://github.com/fullstorydev/lucene-solr/blob/75e89929eb360b513ee864aeb23a80c049747246/solr/core/src/java/org/apache/solr/cloud/SizeLimitedDistributedMap.java#L73-L80]

However, when it performs the actual deletion, it failed with `NoNodeException`

This is likely caused by race condition as multiple threads can enter the same 
code block and try to delete same list of children which the slower threads can 
delete on child node that no longer exists.

 

Such condition can be reproduced by unit test case, which will be included in 
the PR
h2. Solution

Although we could enforce synchronization to prevent threads from purging the 
same set of child nodes, it might not be desirable to add extra blocking.

Instead, it's probably safe to ignore the `KeeperException.NoNodeException` if 
such node is no longer there for the purge operation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to