Samuel García Martínez created SOLR-10181:
---------------------------------------------

             Summary: CREATEALIAS and DELETEALIAS commands consistency problems 
under concurrency
                 Key: SOLR-10181
                 URL: https://issues.apache.org/jira/browse/SOLR-10181
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: SolrCloud
    Affects Versions: 6.4.1, 5.5, 5.4, 5.3
            Reporter: Samuel García Martínez


When several CREATEALIAS are run at the same time by the OCP it could happen 
that, even tho the API response is OK, some of those CREATEALIAS request 
changes are lost.

The problem happens because the CREATEALIAS cmd implementation relies on 
zkStateReader.getAliases() to create the map that will be stored in ZK. If 
several threads reach that line at the same time it will happen that only one 
will be stored correctly and the others will be overridden.

The code I'm referencing is [this 
piece|https://github.com/apache/lucene-solr/blob/8c1e67e30e071ceed636083532d4598bf6a8791f/solr/core/src/java/org/apache/solr/cloud/CreateAliasCmd.java#L65].
 As an example, let's say that the current aliases map has {a:colA, b:colB}. If 
two CREATEALIAS (one adding c:colC and other creating d:colD) are scheduled in 
the _tpe_ and reach that line at the same time, the resulting maps will look 
like {a:colA, b:colB, c:colC} and {a:colA, b:colB, d:colD} and only one of them 
will be stored correctly in ZK, resulting in "data loss", meaning that API is 
returning OK despite that it didn't work as expected.

On top of this, another concurrency problem could happen when the command 
checks the alias being set using _checkForAlias_ method. After the two 
CREATEALIAS zk write being run at the same time, when the alias is being check 
one of the threads can timeout since only one of them has "survived" and has 
been written to the _zkStateReader.getAliases()_ map.

I can post a patch to this if someone gives me directions on how it sould be 
fixed. As I see this, there are two places where the issue can be fixed: in the 
processor (OverseerCollectionMessageHandler) in a generic way or inside the 
command itself.

The processor fix
The locking mechanism (OverseerCollectionMessageHandler#lockTask) should be the 
place to fix this inside the processor. I thought that adding the operation 
name instead of only "collection" or "name" to the locking key would fix the 
issue, but I realized that the problem will happen anyway if the concurrency 
happens between different operations modifying the same resource (like 
CREATEALIAS and DELETEALIAS do). So, if this should be the path to follow I 
don't know what should be used as a locking key.

The command fix
Fixing it at the command level (CreateAliasCmd and DeleteAliasCmd) would be 
relatively easy. Using optimistic locking, i.e, using the aliases.json zk 
version in the keeper.setData. To do that, Aliases class should offer the 
aliases version so the commands can forward that version with the update and 
retry when it fails.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to