[ https://issues.apache.org/jira/browse/SOLR-6956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shalin Shekhar Mangar updated SOLR-6956: ---------------------------------------- Attachment: DeleteReplicaTest-failure.log The full test log from jenkins is attached. The following is the interesting excerpt from the log: {code} [junit4] 2> 357522 T2350 oasco.ZkStateWriter.writePendingUpdates going to update_collection /collections/delLiveColl/state.json version: 10 [junit4] 2> 357523 T2349 oascc.ZkStateReader$7.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/delLiveColl/state.json, has occurred - updating... [junit4] 2> 357523 T2450 oascc.ZkStateReader$7.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/delLiveColl/state.json, has occurred - updating... [junit4] 2> 357523 T2425 oascc.ZkStateReader$7.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/delLiveColl/state.json, has occurred - updating... [junit4] 2> 357523 T2381 oascc.ZkStateReader$7.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/delLiveColl/state.json, has occurred - updating... [junit4] 2> 357523 T2450 oascc.ZkStateReader.updateWatchedCollection Updating data for delLiveColl to ver 11 [junit4] 2> 357524 T2425 oascc.ZkStateReader.updateWatchedCollection Updating data for delLiveColl to ver 11 [junit4] 2> 357524 T2381 oascc.ZkStateReader.updateWatchedCollection Updating data for delLiveColl to ver 11 [junit4] 2> 357559 T2321 oasc.AbstractDistribZkTestBase.waitForRecoveriesToFinish Recoveries finished - collection: delLiveColl [junit4] 2> 357562 T2372 oass.SolrDispatchFilter.handleAdminRequest [admin] webapp=null path=/admin/cores params={core=delLiveColl_shard1_replica1&action=STATUS&indexInfo=true&wt=javabin&version=2} status=0 QTime=1 [junit4] 2> 357565 T2339 oasha.CollectionsHandler.handleRemoveReplica Remove replica: replica=core_node1&action=deletereplica&collection=delLiveColl&shard=shard1&onlyIfDown=true&wt=javabin&version=2 [junit4] 2> 357566 T2364 oasc.DistributedQueue$LatchWatcher.process NodeChildrenChanged fired on path /overseer/collection-queue-work state SyncConnected [junit4] 2> 357567 T2351 oasc.OverseerCollectionProcessor.run Overseer Collection Processor: Get the message id:/overseer/collection-queue-work/qn-0000000002 message:{ [junit4] 2> "operation":"deletereplica", [junit4] 2> "collection":"delLiveColl", [junit4] 2> "shard":"shard1", [junit4] 2> "replica":"core_node1", [junit4] 2> "onlyIfDown":"true"} [junit4] 2> 357567 T2489 oasc.OverseerCollectionProcessor.processMessage WARN OverseerCollectionProcessor.processMessage : deletereplica , { [junit4] 2> "operation":"deletereplica", [junit4] 2> "collection":"delLiveColl", [junit4] 2> "shard":"shard1", [junit4] 2> "replica":"core_node1", [junit4] 2> "onlyIfDown":"true"} [junit4] 2> 357567 T2489 oasc.SolrException.log ERROR Collection: delLiveColl operation: deletereplica failed:org.apache.solr.common.SolrException: Attempted to remove replica : delLiveColl/shard1/core_node1 with onlyIfDown='true', but state is 'recovering' [junit4] 2> at org.apache.solr.cloud.OverseerCollectionProcessor.deleteReplica(OverseerCollectionProcessor.java:1079) [junit4] 2> at org.apache.solr.cloud.OverseerCollectionProcessor.processMessage(OverseerCollectionProcessor.java:631) [junit4] 2> at org.apache.solr.cloud.OverseerCollectionProcessor$Runner.run(OverseerCollectionProcessor.java:2862) [junit4] 2> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [junit4] 2> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [junit4] 2> at java.lang.Thread.run(Thread.java:745) [junit4] 2> [junit4] 2> 357569 T2489 oasc.OverseerCollectionProcessor$Runner.run Overseer Collection Processor: Message id:/overseer/collection-queue-work/qn-0000000002 complete, response:{Operation deletereplica caused exception:=org.apache.solr.common.SolrException: Attempted to remove replica : delLiveColl/shard1/core_node1 with onlyIfDown='true', but state is 'recovering',exception={msg=Attempted to remove replica : delLiveColl/shard1/core_node1 with onlyIfDown='true', but state is 'recovering',rspCode=400}} [junit4] 2> 357570 T2364 oasc.DistributedQueue$LatchWatcher.process NodeDataChanged fired on path /overseer/collection-queue-work/qnr-0000000002 state SyncConnected [junit4] 2> 357570 T2364 oasc.DistributedQueue$LatchWatcher.process NodeChildrenChanged fired on path /overseer/collection-queue-work state SyncConnected [junit4] 2> 357571 T2364 oasc.DistributedQueue$LatchWatcher.process NodeChildrenChanged fired on path /overseer/collection-queue-work state SyncConnected [junit4] 2> 357572 T2339 oass.SolrDispatchFilter.handleAdminRequest [admin] webapp=null path=/admin/collections params={replica=core_node1&action=deletereplica&collection=delLiveColl&shard=shard1&onlyIfDown=true&wt=javabin&version=2} status=400 QTime=7 [junit4] 2> 357574 T2321 oas.SolrTestCaseJ4.tearDown ###Ending deleteLiveReplicaTest [junit4] 2> 357577 T2321 oasc.ChaosMonkey.monkeyLog monkey: stop shard! 55369 [junit4] 2> 357577 T2321 oasc.CoreContainer.shutdown Shutting down CoreContainer instance=802937226 [junit4] 2> 357577 T2321 oasc.ZkController.publish publishing core=collection1 state=down collection=control_collection [junit4] 2> 357577 T2321 oasc.ZkController.publish numShards not found on descriptor - reading it from system property [junit4] 2> 357578 T2364 oasc.DistributedQueue$LatchWatcher.process NodeChildrenChanged fired on path /overseer/queue state SyncConnected [junit4] 2> 357578 T2321 oasc.ZkController.publish publishing core=delLiveColl_shard1_replica2 state=down collection=delLiveColl [junit4] 2> 357579 T2350 oasc.Overseer$ClusterStateUpdater.run processMessage: queueSize: 1, message = { [junit4] 2> "core":"collection1", [junit4] 2> "core_node_name":"core_node1", [junit4] 2> "roles":null, [junit4] 2> "base_url":"http://127.0.0.1:55369", [junit4] 2> "node_name":"127.0.0.1:55369_", [junit4] 2> "numShards":"2", [junit4] 2> "state":"down", [junit4] 2> "shard":"shard1", [junit4] 2> "collection":"control_collection", [junit4] 2> "operation":"state"} current state version: 6 [junit4] 2> 357579 T2321 oasc.SolrCore.close [collection1] CLOSING SolrCore org.apache.solr.core.SolrCore@6eb4277d [junit4] 2> 357579 T2350 oasco.ReplicaMutator.updateState Update state numShards=2 message={ [junit4] 2> "core":"collection1", [junit4] 2> "core_node_name":"core_node1", [junit4] 2> "roles":null, [junit4] 2> "base_url":"http://127.0.0.1:55369", [junit4] 2> "node_name":"127.0.0.1:55369_", [junit4] 2> "numShards":"2", [junit4] 2> "state":"down", [junit4] 2> "shard":"shard1", [junit4] 2> "collection":"control_collection", [junit4] 2> "operation":"state"} [junit4] 2> 357579 T2321 oasc.ZkController.unRegisterConfListener a listener was removed because of core close [junit4] 2> 357579 T2321 oasu.DirectUpdateHandler2.close closing DirectUpdateHandler2{commits=0,autocommits=0,soft autocommits=0,optimizes=0,rollbacks=0,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=0,cumulative_adds=0,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=0,transaction_logs_total_size=0,transaction_logs_total_number=0} [junit4] 2> 357579 T2321 oasu.SolrCoreState.decrefSolrCoreState Closing SolrCoreState [junit4] 2> 357580 T2321 oasu.DefaultSolrCoreState.closeIndexWriter SolrCoreState ref count has reached 0 - closing IndexWriter [junit4] 2> 357580 T2321 oasu.DefaultSolrCoreState.closeIndexWriter closing IndexWriter with IndexWriterCloser [junit4] 2> 357580 T2364 oascc.ZkStateReader$2.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5) [junit4] 2> 357580 T2381 oascc.ZkStateReader$2.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5) [junit4] 2> 357580 T2425 oascc.ZkStateReader$2.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5) [junit4] 2> 357580 T2450 oascc.ZkStateReader$2.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5) [junit4] 2> 357580 T2403 oascc.ZkStateReader$2.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5) [junit4] 2> 357580 T2321 oasc.SolrCore.closeSearcher [collection1] Closing main searcher on request. [junit4] 2> 357586 T2350 oasc.Overseer$ClusterStateUpdater.run processMessage: queueSize: 1, message = { [junit4] 2> "core":"delLiveColl_shard1_replica2", [junit4] 2> "core_node_name":"core_node2", [junit4] 2> "roles":null, [junit4] 2> "base_url":"http://127.0.0.1:55369", [junit4] 2> "node_name":"127.0.0.1:55369_", [junit4] 2> "numShards":"2", [junit4] 2> "state":"down", [junit4] 2> "shard":"shard1", [junit4] 2> "collection":"delLiveColl", [junit4] 2> "operation":"state"} current state version: 7 [junit4] 2> 357587 T2350 oasco.ReplicaMutator.updateState Update state numShards=2 message={ [junit4] 2> "core":"delLiveColl_shard1_replica2", [junit4] 2> "core_node_name":"core_node2", [junit4] 2> "roles":null, [junit4] 2> "base_url":"http://127.0.0.1:55369", [junit4] 2> "node_name":"127.0.0.1:55369_", [junit4] 2> "numShards":"2", [junit4] 2> "state":"down", [junit4] 2> "shard":"shard1", [junit4] 2> "collection":"delLiveColl", [junit4] 2> "operation":"state"} [junit4] 2> 357589 T2490 oasc.DistributedQueue$LatchWatcher.process NodeChildrenChanged fired on path /overseer/queue state SyncConnected [junit4] 2> 357589 T2350 oasco.ZkStateWriter.writePendingUpdates going to update_collection /collections/delLiveColl/state.json version: 11 [junit4] 2> 357589 T2425 oascc.ZkStateReader$7.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/delLiveColl/state.json, has occurred - updating... [junit4] 2> 357589 T2450 oascc.ZkStateReader$7.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/delLiveColl/state.json, has occurred - updating... [junit4] 2> 357589 T2381 oascc.ZkStateReader$7.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/delLiveColl/state.json, has occurred - updating... [junit4] 2> 357590 T2425 oascc.ZkStateReader.updateWatchedCollection Updating data for delLiveColl to ver 12 [junit4] 2> 357590 T2381 oascc.ZkStateReader.updateWatchedCollection Updating data for delLiveColl to ver 12 [junit4] 2> 357590 T2450 oascc.ZkStateReader.updateWatchedCollection Updating data for delLiveColl to ver 12 [junit4] 2> 357593 T2349 oascc.ZkStateReader.updateWatchedCollection Updating data for delLiveColl to ver 12 {code} In this run, the test has 4 nodes and the collection being deleted has 2 shards and 4 replicas distributed across those 4 nodes. The overseer node itself hosts a replica of the collection being deleted. Notice that the Overseer updates the collection state currently at version 10 and 4 watcher threads fire to update state of 'delLiveColl' to version 11. But the ZkStateReader.updateWatchedCollection method is only fired by three watchers. The thread T2349 doesn't execute the updateWatchedCollection method presumably because it is waiting for the update lock. I think this failure is linked to SOLR-6554 where my effort in making the overseer faster causes the update lock to be held up for a long time on the overseer node causing state updates to be not visible to anyone except the Overseer on that node. This is causing the OverseerCollectionProcessor to operate on stale state. This may likely be the reason for various other failures in our cloud tests. > DeleteReplicaTest fails sometimes. > ---------------------------------- > > Key: SOLR-6956 > URL: https://issues.apache.org/jira/browse/SOLR-6956 > Project: Solr > Issue Type: Test > Reporter: Mark Miller > Priority: Minor > Attachments: DeleteReplicaTest-failure.log > > > I still see fails of this test sometimes: > {noformat} > java.lang.AssertionError: Should have had a good message here > at > __randomizedtesting.SeedInfo.seed([D765D9019AAF2D1E:56835719EDF04D22]:0) > at org.junit.Assert.fail(Assert.java:93) > at org.junit.Assert.assertTrue(Assert.java:43) > at > org.apache.solr.cloud.DeleteReplicaTest.deleteLiveReplicaTest(DeleteReplicaTest.java:138) > at > org.apache.solr.cloud.DeleteReplicaTest.doTest(DeleteReplicaTest.java:89) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org