Shalin Shekhar Mangar created SOLR-7673:
-------------------------------------------
Summary: Race condition in shard splitting waiting for sub-shard
replicas to go into recovery
Key: SOLR-7673
URL: https://issues.apache.org/jira/browse/SOLR-7673
Project: Solr
Issue Type: Bug
Components: SolrCloud
Affects Versions: 5.2, 4.10.4
Reporter: Shalin Shekhar Mangar
Assignee: Shalin Shekhar Mangar
Fix For: 5.3, Trunk
I found this while digging into the failure on
https://builds.apache.org/job/Lucene-Solr-Tests-trunk-Java8/69/
{code}
[junit4] 2> 390032 INFO
(OverseerStateUpdate-93987739315601411-127.0.0.1:57926_-n_0000000000)
[n:127.0.0.1:57926_ ] o.a.s.c.o.ReplicaMutator Update state numShards=1
message={
[junit4] 2> "core":"testasynccollectioncreation_shard1_0_replica2",
[junit4] 2> "core_node_name":"core_node5",
[junit4] 2> "roles":null,
[junit4] 2> "base_url":"http://127.0.0.1:38702",
[junit4] 2> "node_name":"127.0.0.1:38702_",
[junit4] 2> "numShards":"1",
[junit4] 2> "state":"active",
[junit4] 2> "shard":"shard1_0",
[junit4] 2> "collection":"testasynccollectioncreation",
[junit4] 2> "operation":"state"}
[junit4] 2> 390033 INFO
(OverseerStateUpdate-93987739315601411-127.0.0.1:57926_-n_0000000000)
[n:127.0.0.1:57926_ ] o.a.s.c.o.ZkStateWriter going to update_collection
/collections/testasynccollectioncreation/state.json version: 18
[junit4] 2> 390033 INFO
(RecoveryThread-testasynccollectioncreation_shard1_0_replica2)
[n:127.0.0.1:38702_ c:testasynccollectioncreation s:shard1_0 r:core_node5
x:testasynccollectioncreation_shard1_0_replica2] o.a.s.u.UpdateLog Took 2 ms to
seed version buckets with highest version 1503803851028824064
[junit4] 2> 390033 INFO
(zkCallback-167-thread-1-processing-n:127.0.0.1:57926_) [n:127.0.0.1:57926_
] o.a.s.c.c.ZkStateReader A cluster state change: WatchedEvent
state:SyncConnected type:NodeDataChanged
path:/collections/testasynccollectioncreation/state.json for collection
testasynccollectioncreation has occurred - updating... (live nodes size: 2)
[junit4] 2> 390033 INFO
(RecoveryThread-testasynccollectioncreation_shard1_0_replica2)
[n:127.0.0.1:38702_ c:testasynccollectioncreation s:shard1_0 r:core_node5
x:testasynccollectioncreation_shard1_0_replica2] o.a.s.c.RecoveryStrategy
Finished recovery process.
[junit4] 2> 390033 INFO
(zkCallback-173-thread-1-processing-n:127.0.0.1:38702_) [n:127.0.0.1:38702_
] o.a.s.c.c.ZkStateReader A cluster state change: WatchedEvent
state:SyncConnected type:NodeDataChanged
path:/collections/testasynccollectioncreation/state.json for collection
testasynccollectioncreation has occurred - updating... (live nodes size: 2)
[junit4] 2> 390034 INFO
(zkCallback-167-thread-1-processing-n:127.0.0.1:57926_) [n:127.0.0.1:57926_
] o.a.s.c.c.ZkStateReader Updating data for testasynccollectioncreation to ver
19
[junit4] 2> 390034 INFO
(zkCallback-173-thread-1-processing-n:127.0.0.1:38702_) [n:127.0.0.1:38702_
] o.a.s.c.c.ZkStateReader Updating data for testasynccollectioncreation to ver
19
[junit4] 2> 390298 INFO (qtp1859109869-1664) [n:127.0.0.1:38702_ ]
o.a.s.h.a.CollectionsHandler Invoked Collection Action :requeststatus with
paramsrequestid=1004&action=REQUESTSTATUS&wt=javabin&version=2
[junit4] 2> 390299 INFO (qtp1859109869-1664) [n:127.0.0.1:38702_ ]
o.a.s.s.SolrDispatchFilter [admin] webapp=null path=/admin/collections
params={requestid=1004&action=REQUESTSTATUS&wt=javabin&version=2} status=0
QTime=1
[junit4] 2> 390348 INFO (qtp1859109869-1665) [n:127.0.0.1:38702_ ]
o.a.s.h.a.CoreAdminHandler Checking request status for : 10042599422118342586
[junit4] 2> 390348 INFO (qtp1859109869-1665) [n:127.0.0.1:38702_ ]
o.a.s.s.SolrDispatchFilter [admin] webapp=null path=/admin/cores
params={qt=/admin/cores&requestid=10042599422118342586&action=REQUESTSTATUS&wt=javabin&version=2}
status=0 QTime=1
[junit4] 2> 390348 INFO
(OverseerThreadFactory-891-thread-4-processing-n:127.0.0.1:57926_)
[n:127.0.0.1:57926_ ] o.a.s.c.OverseerCollectionProcessor Asking sub shard
leader to wait for: testasynccollectioncreation_shard1_0_replica2 to be alive
on: 127.0.0.1:38702_
[junit4] 2> 390349 INFO
(OverseerThreadFactory-891-thread-4-processing-n:127.0.0.1:57926_)
[n:127.0.0.1:57926_ ] o.a.s.c.OverseerCollectionProcessor Creating replica
shard testasynccollectioncreation_shard1_1_replica2 as part of slice shard1_1
of collection testasynccollectioncreation on 127.0.0.1:38702_
[junit4] 2> 390349 INFO
(OverseerThreadFactory-891-thread-4-processing-n:127.0.0.1:57926_)
[n:127.0.0.1:57926_ ] o.a.s.c.c.ZkStateReader Load collection config
from:/collections/testasynccollectioncreation
[junit4] 2> 390350 INFO
(OverseerThreadFactory-891-thread-4-processing-n:127.0.0.1:57926_)
[n:127.0.0.1:57926_ ] o.a.s.c.c.ZkStateReader
path=/collections/testasynccollectioncreation configName=conf1 specified config
exists in ZooKeeper
[junit4] 2> 390352 INFO (qtp381614561-1631) [n:127.0.0.1:57926_ ]
o.a.s.s.SolrDispatchFilter [admin] webapp=null path=/admin/cores
params={nodeName=127.0.0.1:38702_&core=testasynccollectioncreation_shard1_0_replica1&async=10042599424132346542&qt=/admin/cores&coreNodeName=core_node5&action=PREPRECOVERY&checkLive=true&state=recovering&onlyIfLeader=true&wt=javabin&version=2}
status=0 QTime=2
{code}
The following sequence of events lead to deadlock:
# testasynccollectioncreation_shard1_0_replica2 (core_node5) becomes active
# OCP asks sub-shard leader testasynccollectioncreation_shard1_0_replica1 to
wait until replica2 is in recovery
At this point, the test just keeps pinging for status until timeout and fails.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]