[ https://issues.apache.org/jira/browse/SOLR-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746392#comment-17746392 ]
ASF subversion and git services commented on SOLR-16871: -------------------------------------------------------- Commit ccc7ca65f12ee604c2194105b1b7c44822ad15ae in solr's branch refs/heads/main from patsonluk [ https://gitbox.apache.org/repos/asf?p=solr.git;h=ccc7ca65f12 ] SOLR-16871: Synchronize on a larger block to avoid race condition in CoordinatorHttpSolrCall init (#1800) * Synchronize to avoid race condition in CoordinatorHttpSolrCall * ./gradlew tidy > Race condition for coordinator node init > ---------------------------------------- > > Key: SOLR-16871 > URL: https://issues.apache.org/jira/browse/SOLR-16871 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Reporter: Patson Luk > Priority: Major > Time Spent: 3h 50m > Remaining Estimate: 0h > > From a unit test case [that issue concurrent select queries to coordinator > nodes|https://github.com/cowpaths/fullstory-solr/blob/e4226eb8fa2afb01d7615f7faea01f71b144cd58/solr/core/src/test/org/apache/solr/search/TestCoordinatorRole.java#L486], > it’s found that there could be 3 race condition issues: > 1. If multiple concurrent requests find the synthetic collection is not yet > created, they might all attempt to create the synthetic collection. This > could trigger SolrException on `collection already exists` > 2. Similarly, if multiple concurrent requests find there’s no replica of the > synthetic collection for current node (multiple coordinator node scenario), > then CoordinatorHttpSolrCall#addReplica could be invoked multiple times. This > should not trigger any exception, but would create multiple replicas for the > same node in the synthetic collection > 3. The existing logic > [here|https://github.com/cowpaths/fullstory-solr/blob/6c8531f08301a291478502c262499abed0d5075c/solr/core/src/java/org/apache/solr/servlet/CoordinatorHttpSolrCall.java#L102] > assumes if > syntheticColl.getReplicas(solrCall.cores.getZkController().getNodeName()) > returns non empty result, then the following call in > [here|https://github.com/cowpaths/fullstory-solr/blob/6c8531f08301a291478502c262499abed0d5075c/solr/core/src/java/org/apache/solr/servlet/CoordinatorHttpSolrCall.java#L112] > should return a core. Unfortunately, the first call can return a non empty > list but with a DOWN replica if another request is in the progress of > creating such replica. In this case, the > solrCall.getCoreByCollection(syntheticCollectionName, isPreferLeader) would > call super.getCoreByCollection at > [here|https://github.com/cowpaths/fullstory-solr/blob/6c8531f08301a291478502c262499abed0d5075c/solr/core/src/java/org/apache/solr/servlet/CoordinatorHttpSolrCall.java#L69] > which would return a null (since super impl only returns active replica). So > CoordinatorHttpSolrCall#getCoreByCollection would end up calling > CoordinatorHttpSolrCall#getCore , introducing an infinite loop and cause > stack overflow -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org