[ 
https://issues.apache.org/jira/browse/SOLR-6956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-6956:
----------------------------------------
    Attachment: DeleteReplicaTest-failure.log

The full test log from jenkins is attached.

The following is the interesting excerpt from the log:
{code}
[junit4]   2> 357522 T2350 oasco.ZkStateWriter.writePendingUpdates going to 
update_collection /collections/delLiveColl/state.json version: 10
   [junit4]   2> 357523 T2349 oascc.ZkStateReader$7.process A cluster state 
change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/collections/delLiveColl/state.json, has occurred - updating... 
   [junit4]   2> 357523 T2450 oascc.ZkStateReader$7.process A cluster state 
change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/collections/delLiveColl/state.json, has occurred - updating... 
   [junit4]   2> 357523 T2425 oascc.ZkStateReader$7.process A cluster state 
change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/collections/delLiveColl/state.json, has occurred - updating... 
   [junit4]   2> 357523 T2381 oascc.ZkStateReader$7.process A cluster state 
change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/collections/delLiveColl/state.json, has occurred - updating... 
   [junit4]   2> 357523 T2450 oascc.ZkStateReader.updateWatchedCollection 
Updating data for delLiveColl to ver 11 
   [junit4]   2> 357524 T2425 oascc.ZkStateReader.updateWatchedCollection 
Updating data for delLiveColl to ver 11 
   [junit4]   2> 357524 T2381 oascc.ZkStateReader.updateWatchedCollection 
Updating data for delLiveColl to ver 11 
   [junit4]   2> 357559 T2321 
oasc.AbstractDistribZkTestBase.waitForRecoveriesToFinish Recoveries finished - 
collection: delLiveColl
   [junit4]   2> 357562 T2372 oass.SolrDispatchFilter.handleAdminRequest 
[admin] webapp=null path=/admin/cores 
params={core=delLiveColl_shard1_replica1&action=STATUS&indexInfo=true&wt=javabin&version=2}
 status=0 QTime=1 
   [junit4]   2> 357565 T2339 oasha.CollectionsHandler.handleRemoveReplica 
Remove replica: 
replica=core_node1&action=deletereplica&collection=delLiveColl&shard=shard1&onlyIfDown=true&wt=javabin&version=2
   [junit4]   2> 357566 T2364 oasc.DistributedQueue$LatchWatcher.process 
NodeChildrenChanged fired on path /overseer/collection-queue-work state 
SyncConnected
   [junit4]   2> 357567 T2351 oasc.OverseerCollectionProcessor.run Overseer 
Collection Processor: Get the message 
id:/overseer/collection-queue-work/qn-0000000002 message:{
   [junit4]   2>          "operation":"deletereplica",
   [junit4]   2>          "collection":"delLiveColl",
   [junit4]   2>          "shard":"shard1",
   [junit4]   2>          "replica":"core_node1",
   [junit4]   2>          "onlyIfDown":"true"}
   [junit4]   2> 357567 T2489 oasc.OverseerCollectionProcessor.processMessage 
WARN OverseerCollectionProcessor.processMessage : deletereplica , {
   [junit4]   2>          "operation":"deletereplica",
   [junit4]   2>          "collection":"delLiveColl",
   [junit4]   2>          "shard":"shard1",
   [junit4]   2>          "replica":"core_node1",
   [junit4]   2>          "onlyIfDown":"true"}
   [junit4]   2> 357567 T2489 oasc.SolrException.log ERROR Collection: 
delLiveColl operation: deletereplica 
failed:org.apache.solr.common.SolrException: Attempted to remove replica : 
delLiveColl/shard1/core_node1 with onlyIfDown='true', but state is 'recovering'
   [junit4]   2>                at 
org.apache.solr.cloud.OverseerCollectionProcessor.deleteReplica(OverseerCollectionProcessor.java:1079)
   [junit4]   2>                at 
org.apache.solr.cloud.OverseerCollectionProcessor.processMessage(OverseerCollectionProcessor.java:631)
   [junit4]   2>                at 
org.apache.solr.cloud.OverseerCollectionProcessor$Runner.run(OverseerCollectionProcessor.java:2862)
   [junit4]   2>                at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   [junit4]   2>                at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   [junit4]   2>                at java.lang.Thread.run(Thread.java:745)
   [junit4]   2>        
   [junit4]   2> 357569 T2489 oasc.OverseerCollectionProcessor$Runner.run 
Overseer Collection Processor: Message 
id:/overseer/collection-queue-work/qn-0000000002 complete, response:{Operation 
deletereplica caused exception:=org.apache.solr.common.SolrException: Attempted 
to remove replica : delLiveColl/shard1/core_node1 with onlyIfDown='true', but 
state is 'recovering',exception={msg=Attempted to remove replica : 
delLiveColl/shard1/core_node1 with onlyIfDown='true', but state is 
'recovering',rspCode=400}}
   [junit4]   2> 357570 T2364 oasc.DistributedQueue$LatchWatcher.process 
NodeDataChanged fired on path /overseer/collection-queue-work/qnr-0000000002 
state SyncConnected
   [junit4]   2> 357570 T2364 oasc.DistributedQueue$LatchWatcher.process 
NodeChildrenChanged fired on path /overseer/collection-queue-work state 
SyncConnected
   [junit4]   2> 357571 T2364 oasc.DistributedQueue$LatchWatcher.process 
NodeChildrenChanged fired on path /overseer/collection-queue-work state 
SyncConnected
   [junit4]   2> 357572 T2339 oass.SolrDispatchFilter.handleAdminRequest 
[admin] webapp=null path=/admin/collections 
params={replica=core_node1&action=deletereplica&collection=delLiveColl&shard=shard1&onlyIfDown=true&wt=javabin&version=2}
 status=400 QTime=7 
   [junit4]   2> 357574 T2321 oas.SolrTestCaseJ4.tearDown ###Ending 
deleteLiveReplicaTest
   [junit4]   2> 357577 T2321 oasc.ChaosMonkey.monkeyLog monkey: stop shard! 
55369
   [junit4]   2> 357577 T2321 oasc.CoreContainer.shutdown Shutting down 
CoreContainer instance=802937226
   [junit4]   2> 357577 T2321 oasc.ZkController.publish publishing 
core=collection1 state=down collection=control_collection
   [junit4]   2> 357577 T2321 oasc.ZkController.publish numShards not found on 
descriptor - reading it from system property
   [junit4]   2> 357578 T2364 oasc.DistributedQueue$LatchWatcher.process 
NodeChildrenChanged fired on path /overseer/queue state SyncConnected
   [junit4]   2> 357578 T2321 oasc.ZkController.publish publishing 
core=delLiveColl_shard1_replica2 state=down collection=delLiveColl
   [junit4]   2> 357579 T2350 oasc.Overseer$ClusterStateUpdater.run 
processMessage: queueSize: 1, message = {
   [junit4]   2>          "core":"collection1",
   [junit4]   2>          "core_node_name":"core_node1",
   [junit4]   2>          "roles":null,
   [junit4]   2>          "base_url":"http://127.0.0.1:55369";,
   [junit4]   2>          "node_name":"127.0.0.1:55369_",
   [junit4]   2>          "numShards":"2",
   [junit4]   2>          "state":"down",
   [junit4]   2>          "shard":"shard1",
   [junit4]   2>          "collection":"control_collection",
   [junit4]   2>          "operation":"state"} current state version: 6
   [junit4]   2> 357579 T2321 oasc.SolrCore.close [collection1]  CLOSING 
SolrCore org.apache.solr.core.SolrCore@6eb4277d
   [junit4]   2> 357579 T2350 oasco.ReplicaMutator.updateState Update state 
numShards=2 message={
   [junit4]   2>          "core":"collection1",
   [junit4]   2>          "core_node_name":"core_node1",
   [junit4]   2>          "roles":null,
   [junit4]   2>          "base_url":"http://127.0.0.1:55369";,
   [junit4]   2>          "node_name":"127.0.0.1:55369_",
   [junit4]   2>          "numShards":"2",
   [junit4]   2>          "state":"down",
   [junit4]   2>          "shard":"shard1",
   [junit4]   2>          "collection":"control_collection",
   [junit4]   2>          "operation":"state"}
   [junit4]   2> 357579 T2321 oasc.ZkController.unRegisterConfListener  a 
listener was removed because of core close
   [junit4]   2> 357579 T2321 oasu.DirectUpdateHandler2.close closing 
DirectUpdateHandler2{commits=0,autocommits=0,soft 
autocommits=0,optimizes=0,rollbacks=0,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=0,cumulative_adds=0,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=0,transaction_logs_total_size=0,transaction_logs_total_number=0}
   [junit4]   2> 357579 T2321 oasu.SolrCoreState.decrefSolrCoreState Closing 
SolrCoreState
   [junit4]   2> 357580 T2321 oasu.DefaultSolrCoreState.closeIndexWriter 
SolrCoreState ref count has reached 0 - closing IndexWriter
   [junit4]   2> 357580 T2321 oasu.DefaultSolrCoreState.closeIndexWriter 
closing IndexWriter with IndexWriterCloser
   [junit4]   2> 357580 T2364 oascc.ZkStateReader$2.process A cluster state 
change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/clusterstate.json, has occurred - updating... (live nodes size: 5)
   [junit4]   2> 357580 T2381 oascc.ZkStateReader$2.process A cluster state 
change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/clusterstate.json, has occurred - updating... (live nodes size: 5)
   [junit4]   2> 357580 T2425 oascc.ZkStateReader$2.process A cluster state 
change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/clusterstate.json, has occurred - updating... (live nodes size: 5)
   [junit4]   2> 357580 T2450 oascc.ZkStateReader$2.process A cluster state 
change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/clusterstate.json, has occurred - updating... (live nodes size: 5)
   [junit4]   2> 357580 T2403 oascc.ZkStateReader$2.process A cluster state 
change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/clusterstate.json, has occurred - updating... (live nodes size: 5)
   [junit4]   2> 357580 T2321 oasc.SolrCore.closeSearcher [collection1] Closing 
main searcher on request.
   [junit4]   2> 357586 T2350 oasc.Overseer$ClusterStateUpdater.run 
processMessage: queueSize: 1, message = {
   [junit4]   2>          "core":"delLiveColl_shard1_replica2",
   [junit4]   2>          "core_node_name":"core_node2",
   [junit4]   2>          "roles":null,
   [junit4]   2>          "base_url":"http://127.0.0.1:55369";,
   [junit4]   2>          "node_name":"127.0.0.1:55369_",
   [junit4]   2>          "numShards":"2",
   [junit4]   2>          "state":"down",
   [junit4]   2>          "shard":"shard1",
   [junit4]   2>          "collection":"delLiveColl",
   [junit4]   2>          "operation":"state"} current state version: 7
   [junit4]   2> 357587 T2350 oasco.ReplicaMutator.updateState Update state 
numShards=2 message={
   [junit4]   2>          "core":"delLiveColl_shard1_replica2",
   [junit4]   2>          "core_node_name":"core_node2",
   [junit4]   2>          "roles":null,
   [junit4]   2>          "base_url":"http://127.0.0.1:55369";,
   [junit4]   2>          "node_name":"127.0.0.1:55369_",
   [junit4]   2>          "numShards":"2",
   [junit4]   2>          "state":"down",
   [junit4]   2>          "shard":"shard1",
   [junit4]   2>          "collection":"delLiveColl",
   [junit4]   2>          "operation":"state"}
   [junit4]   2> 357589 T2490 oasc.DistributedQueue$LatchWatcher.process 
NodeChildrenChanged fired on path /overseer/queue state SyncConnected
   [junit4]   2> 357589 T2350 oasco.ZkStateWriter.writePendingUpdates going to 
update_collection /collections/delLiveColl/state.json version: 11
   [junit4]   2> 357589 T2425 oascc.ZkStateReader$7.process A cluster state 
change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/collections/delLiveColl/state.json, has occurred - updating... 
   [junit4]   2> 357589 T2450 oascc.ZkStateReader$7.process A cluster state 
change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/collections/delLiveColl/state.json, has occurred - updating... 
   [junit4]   2> 357589 T2381 oascc.ZkStateReader$7.process A cluster state 
change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/collections/delLiveColl/state.json, has occurred - updating... 
   [junit4]   2> 357590 T2425 oascc.ZkStateReader.updateWatchedCollection 
Updating data for delLiveColl to ver 12 
   [junit4]   2> 357590 T2381 oascc.ZkStateReader.updateWatchedCollection 
Updating data for delLiveColl to ver 12 
   [junit4]   2> 357590 T2450 oascc.ZkStateReader.updateWatchedCollection 
Updating data for delLiveColl to ver 12 
   [junit4]   2> 357593 T2349 oascc.ZkStateReader.updateWatchedCollection 
Updating data for delLiveColl to ver 12 
{code}

In this run, the test has 4 nodes and the collection being deleted has 2 shards 
and 4 replicas distributed across those 4 nodes. The overseer node itself hosts 
a replica of the collection being deleted. 

Notice that the Overseer updates the collection state currently at version 10 
and 4 watcher threads fire to update state of 'delLiveColl' to version 11. But 
the ZkStateReader.updateWatchedCollection method is only fired by three 
watchers. The thread T2349 doesn't execute the updateWatchedCollection method 
presumably because it is waiting for the update lock.

I think this failure is linked to SOLR-6554 where my effort in making the 
overseer faster causes the update lock to be held up for a long time on the 
overseer node causing state updates to be not visible to anyone except the 
Overseer on that node. This is causing the OverseerCollectionProcessor to 
operate on stale state. This may likely be the reason for various other 
failures in our cloud tests.

> DeleteReplicaTest fails sometimes.
> ----------------------------------
>
>                 Key: SOLR-6956
>                 URL: https://issues.apache.org/jira/browse/SOLR-6956
>             Project: Solr
>          Issue Type: Test
>            Reporter: Mark Miller
>            Priority: Minor
>         Attachments: DeleteReplicaTest-failure.log
>
>
> I still see fails of this test sometimes:
> {noformat}
> java.lang.AssertionError: Should have had a good message here
>       at 
> __randomizedtesting.SeedInfo.seed([D765D9019AAF2D1E:56835719EDF04D22]:0)
>       at org.junit.Assert.fail(Assert.java:93)
>       at org.junit.Assert.assertTrue(Assert.java:43)
>       at 
> org.apache.solr.cloud.DeleteReplicaTest.deleteLiveReplicaTest(DeleteReplicaTest.java:138)
>       at 
> org.apache.solr.cloud.DeleteReplicaTest.doTest(DeleteReplicaTest.java:89)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to