[ 
https://issues.apache.org/jira/browse/SOLR-7673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-7673:
----------------------------------------
    Attachment: SOLR-7673.patch

Here's a patch which fixes both the race conditions.

This patch creates the replicas in cluster state first (by submitting an 
addreplica command to the overseer queue), then it calls updateshardstate to 
set the sub-slices to 'recovery' state and then it invokes addreplica action to 
actually create the replica cores.

A new flag called "skipCreateReplicaInClusterState" is added to the addreplica 
action so that OCP doesn't try to create the replica in cluster state again. 
This is a hack to avoid duplicating most of the logic of addreplica inside 
splitshard and I've put in comments to this effect. I did not want to perform 
potentially dangerous refactoring as part of this bug fix but once this is 
committed, we can open a separate issue just to refactor OCP.

I've been beasting this test quite a bit and haven't been able to reproduce the 
failures.

> Race condition in shard splitting waiting for sub-shard replicas to go into 
> recovery
> ------------------------------------------------------------------------------------
>
>                 Key: SOLR-7673
>                 URL: https://issues.apache.org/jira/browse/SOLR-7673
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.10.4, 5.2
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 5.3, Trunk
>
>         Attachments: Lucene-Solr-Tests-trunk-Java8-69.log, SOLR-7673.patch
>
>
> I found this while digging into the failure on 
> https://builds.apache.org/job/Lucene-Solr-Tests-trunk-Java8/69/
> {code}
>    [junit4]   2> 390032 INFO  
> (OverseerStateUpdate-93987739315601411-127.0.0.1:57926_-n_0000000000) 
> [n:127.0.0.1:57926_    ] o.a.s.c.o.ReplicaMutator Update state numShards=1 
> message={
>    [junit4]   2>   "core":"testasynccollectioncreation_shard1_0_replica2",
>    [junit4]   2>   "core_node_name":"core_node5",
>    [junit4]   2>   "roles":null,
>    [junit4]   2>   "base_url":"http://127.0.0.1:38702";,
>    [junit4]   2>   "node_name":"127.0.0.1:38702_",
>    [junit4]   2>   "numShards":"1",
>    [junit4]   2>   "state":"active",
>    [junit4]   2>   "shard":"shard1_0",
>    [junit4]   2>   "collection":"testasynccollectioncreation",
>    [junit4]   2>   "operation":"state"}
>    [junit4]   2> 390033 INFO  
> (OverseerStateUpdate-93987739315601411-127.0.0.1:57926_-n_0000000000) 
> [n:127.0.0.1:57926_    ] o.a.s.c.o.ZkStateWriter going to update_collection 
> /collections/testasynccollectioncreation/state.json version: 18
>    [junit4]   2> 390033 INFO  
> (RecoveryThread-testasynccollectioncreation_shard1_0_replica2) 
> [n:127.0.0.1:38702_ c:testasynccollectioncreation s:shard1_0 r:core_node5 
> x:testasynccollectioncreation_shard1_0_replica2] o.a.s.u.UpdateLog Took 2 ms 
> to seed version buckets with highest version 1503803851028824064
>    [junit4]   2> 390033 INFO  
> (zkCallback-167-thread-1-processing-n:127.0.0.1:57926_) [n:127.0.0.1:57926_   
>  ] o.a.s.c.c.ZkStateReader A cluster state change: WatchedEvent 
> state:SyncConnected type:NodeDataChanged 
> path:/collections/testasynccollectioncreation/state.json for collection 
> testasynccollectioncreation has occurred - updating... (live nodes size: 2)
>    [junit4]   2> 390033 INFO  
> (RecoveryThread-testasynccollectioncreation_shard1_0_replica2) 
> [n:127.0.0.1:38702_ c:testasynccollectioncreation s:shard1_0 r:core_node5 
> x:testasynccollectioncreation_shard1_0_replica2] o.a.s.c.RecoveryStrategy 
> Finished recovery process.
>    [junit4]   2> 390033 INFO  
> (zkCallback-173-thread-1-processing-n:127.0.0.1:38702_) [n:127.0.0.1:38702_   
>  ] o.a.s.c.c.ZkStateReader A cluster state change: WatchedEvent 
> state:SyncConnected type:NodeDataChanged 
> path:/collections/testasynccollectioncreation/state.json for collection 
> testasynccollectioncreation has occurred - updating... (live nodes size: 2)
>    [junit4]   2> 390034 INFO  
> (zkCallback-167-thread-1-processing-n:127.0.0.1:57926_) [n:127.0.0.1:57926_   
>  ] o.a.s.c.c.ZkStateReader Updating data for testasynccollectioncreation to 
> ver 19 
>    [junit4]   2> 390034 INFO  
> (zkCallback-173-thread-1-processing-n:127.0.0.1:38702_) [n:127.0.0.1:38702_   
>  ] o.a.s.c.c.ZkStateReader Updating data for testasynccollectioncreation to 
> ver 19 
>    [junit4]   2> 390298 INFO  (qtp1859109869-1664) [n:127.0.0.1:38702_    ] 
> o.a.s.h.a.CollectionsHandler Invoked Collection Action :requeststatus with 
> paramsrequestid=1004&action=REQUESTSTATUS&wt=javabin&version=2 
>    [junit4]   2> 390299 INFO  (qtp1859109869-1664) [n:127.0.0.1:38702_    ] 
> o.a.s.s.SolrDispatchFilter [admin] webapp=null path=/admin/collections 
> params={requestid=1004&action=REQUESTSTATUS&wt=javabin&version=2} status=0 
> QTime=1 
>    [junit4]   2> 390348 INFO  (qtp1859109869-1665) [n:127.0.0.1:38702_    ] 
> o.a.s.h.a.CoreAdminHandler Checking request status for : 10042599422118342586
>    [junit4]   2> 390348 INFO  (qtp1859109869-1665) [n:127.0.0.1:38702_    ] 
> o.a.s.s.SolrDispatchFilter [admin] webapp=null path=/admin/cores 
> params={qt=/admin/cores&requestid=10042599422118342586&action=REQUESTSTATUS&wt=javabin&version=2}
>  status=0 QTime=1 
>    [junit4]   2> 390348 INFO  
> (OverseerThreadFactory-891-thread-4-processing-n:127.0.0.1:57926_) 
> [n:127.0.0.1:57926_    ] o.a.s.c.OverseerCollectionProcessor Asking sub shard 
> leader to wait for: testasynccollectioncreation_shard1_0_replica2 to be alive 
> on: 127.0.0.1:38702_
>    [junit4]   2> 390349 INFO  
> (OverseerThreadFactory-891-thread-4-processing-n:127.0.0.1:57926_) 
> [n:127.0.0.1:57926_    ] o.a.s.c.OverseerCollectionProcessor Creating replica 
> shard testasynccollectioncreation_shard1_1_replica2 as part of slice shard1_1 
> of collection testasynccollectioncreation on 127.0.0.1:38702_
>    [junit4]   2> 390349 INFO  
> (OverseerThreadFactory-891-thread-4-processing-n:127.0.0.1:57926_) 
> [n:127.0.0.1:57926_    ] o.a.s.c.c.ZkStateReader Load collection config 
> from:/collections/testasynccollectioncreation
>    [junit4]   2> 390350 INFO  
> (OverseerThreadFactory-891-thread-4-processing-n:127.0.0.1:57926_) 
> [n:127.0.0.1:57926_    ] o.a.s.c.c.ZkStateReader 
> path=/collections/testasynccollectioncreation configName=conf1 specified 
> config exists in ZooKeeper
>    [junit4]   2> 390352 INFO  (qtp381614561-1631) [n:127.0.0.1:57926_    ] 
> o.a.s.s.SolrDispatchFilter [admin] webapp=null path=/admin/cores 
> params={nodeName=127.0.0.1:38702_&core=testasynccollectioncreation_shard1_0_replica1&async=10042599424132346542&qt=/admin/cores&coreNodeName=core_node5&action=PREPRECOVERY&checkLive=true&state=recovering&onlyIfLeader=true&wt=javabin&version=2}
>  status=0 QTime=2 
> {code}
> The following sequence of events lead to deadlock:
> # testasynccollectioncreation_shard1_0_replica2 (core_node5) becomes active
> # OCP asks sub-shard leader testasynccollectioncreation_shard1_0_replica1  to 
> wait until replica2 is in recovery
> At this point, the test just keeps pinging for status until timeout and fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to