[GitHub] [solr] patsonluk opened a new pull request, #1762: SOLR-16871: Race condition in `CoordinatorHttpSolrCall` synthetic collection/replica init

via GitHub Thu, 06 Jul 2023 15:25:11 -0700


patsonluk opened a new pull request, #1762:
URL: https://github.com/apache/solr/pull/1762

https://issues.apache.org/jira/browse/SOLR-16871

# Description

From a unit test case [that issue concurrent select queries to coordinator
nodes](https://github.com/cowpaths/fullstory-solr/blob/e4226eb8fa2afb01d7615f7faea01f71b144cd58/solr/core/src/test/org/apache/solr/search/TestCoordinatorRole.java#L486),
it’s found that there could be 3 race condition issues:

1. If multiple concurrent requests find the synthetic collection is not yet
created, they might all attempt to create the synthetic collection. This could
trigger SolrException on `collection already exists`

2. Similarly, if multiple concurrent requests find there’s no replica of the
synthetic collection for current node (multiple coordinator node scenario),
then CoordinatorHttpSolrCall#addReplica could be invoked multiple times. This
should not trigger any exception, but would create multiple replicas for the
same node in the synthetic collection

3. The existing logic
[here](https://github.com/cowpaths/fullstory-solr/blob/6c8531f08301a291478502c262499abed0d5075c/solr/core/src/java/org/apache/solr/servlet/CoordinatorHttpSolrCall.java#L102)
assumes if
syntheticColl.getReplicas(solrCall.cores.getZkController().getNodeName())
returns non empty result, then the following call in
[here](https://github.com/cowpaths/fullstory-solr/blob/6c8531f08301a291478502c262499abed0d5075c/solr/core/src/java/org/apache/solr/servlet/CoordinatorHttpSolrCall.java#L112)
should return a core. Unfortunately, the first call can return a non empty
list but with a DOWN replica if another request is in the progress of creating
such replica. In this case, the
solrCall.getCoreByCollection(syntheticCollectionName, isPreferLeader) would
call super.getCoreByCollection at
[here](https://github.com/cowpaths/fullstory-solr/blob/6c8531f08301a291478502c262499abed0d5075c/solr/core/src/java/org/apache/solr/servlet/CoordinatorHttpSolrCall.java#L69)
which would return
a null (since super impl only returns active replica). So
CoordinatorHttpSolrCall#getCoreByCollection would end up calling
CoordinatorHttpSolrCall#getCore , introducing an infinite loop and cause stack
overflow

# Solution

1. For collection creation exception, check again if the collection exists,
if so, ignore the exception and proceed
2. For replica, if the replica for such node already found in the
DocCollection, then ensure that it's active using `zkStateReader.waitForState`.
This avoids the infinite loop caused by the presence of `down` replica.

Take note that this does NOT avoid the 2nd issue above, concurrent requests
can still create multiple replica for the same node in the synthetic
collection, though it's probably benign (and unlikely)

Remarks: First attempt was actually provide proper locking to avoid race
condition. However, it's quite tricky to get it right - might need to force
refresh the zkReader and do multiple extra reads. The extra cost and complexity
probably does not justify the gain.

# Tests

Added `TestCooridnatorRole#testConcurrentAccess` to reproduce the issue

# Checklist

Please review the following and check all that apply:

- [x] I have reviewed the guidelines for [How to
Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms
to the standards described there to the best of my ability.
- [x] I have created a Jira issue and added the issue ID to my pull request
title.
- [ ] I have given Solr maintainers
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
to contribute to my PR branch. (optional but recommended)
- [x] I have developed this patch against the `main` branch.
- [ ] I have run `./gradlew check`.
- [x] I have added tests for my changes.
- [ ] I have added documentation for the [Reference
Guide](https://github.com/apache/solr/tree/main/solr/solr-ref-guide)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr] patsonluk opened a new pull request, #1762: SOLR-16871: Race condition in `CoordinatorHttpSolrCall` synthetic collection/replica init

Reply via email to