[ https://issues.apache.org/jira/browse/SOLR-9945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830097#comment-15830097 ]
Mark Miller commented on SOLR-9945: ----------------------------------- bq. first set the node's state to DOWN, It's a bug to do that, see my comment in SOLR-9555. > LIR should check the node is recovering before bring it down > ------------------------------------------------------------ > > Key: SOLR-9945 > URL: https://issues.apache.org/jira/browse/SOLR-9945 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Cao Manh Dat > Attachments: SOLR-9945.patch > > > When a node is recovering, the leader can meet an exception when trying to > send an update to the buffering node. So the leader will try running LIR > process: first set the node's state to DOWN, then send recovery OP to the > node. > In the same time, PrepRecoveryOp will make the leader wait for a very long > time to see the node's state is RECOVERING. > This scenario can easily be achieved by using this test > {code} > String collection = "collection2"; > CollectionAdminRequest > .createCollection(collection, "config", 1, 2) > .setMaxShardsPerNode(1) > .process(cluster.getSolrClient()); > AbstractDistribZkTestBase.waitForRecoveriesToFinish(collection, > cluster.getSolrClient().getZkStateReader(), > false, true, 30); > CloudSolrClient cloudClient = cluster.getSolrClient(); > DocCollection docCollection = > cloudClient.getZkStateReader().getClusterState().getCollection(collection); > Slice slice = docCollection.getSlice("shard1"); > Replica replicaNode = slice.getReplicas(replica -> replica != > slice.getLeader()).get(0); > JettySolrRunner replicaRunner = cluster.getReplicaJetty(replicaNode); > new UpdateRequest() > .add(sdoc("id", "1")) > .process(cloudClient, collection); > ChaosMonkey.stop(replicaRunner); > new UpdateRequest() > .add(sdoc("id", "2")) > .process(cloudClient, collection); > ChaosMonkey.start(replicaRunner); > new UpdateRequest() > .add(sdoc("id", "3")) > .process(cloudClient, collection); > AbstractDistribZkTestBase.waitForRecoveriesToFinish(collection, > cluster.getSolrClient().getZkStateReader(), > false, true, 60); > CollectionAdminRequest > .deleteCollection(collection) > .process(cloudClient); > {code} > Here are some log > {code} > 10942 INFO (searcherExecutor-53-thread-1-processing-n:127.0.0.1:36014_solr > x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) > [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 > x:collection1_shard2_replica2] o.a.s.c.SolrCore [collection1_shard2_replica2] > Registered new searcher Searcher@510d1ea5[collection1_shard2_replica2] > main{ExitableDirectoryReader(UninvertingDirectoryReader())} > 10942 INFO (searcherExecutor-56-thread-1-processing-n:127.0.0.1:36014_solr > x:collection1_shard1_replica2 s:shard1 c:collection1 r:core_node3) > [n:127.0.0.1:36014_solr c:collection1 s:shard1 r:core_node3 > x:collection1_shard1_replica2] o.a.s.c.SolrCore [collection1_shard1_replica2] > Registered new searcher Searcher@505d7839[collection1_shard1_replica2] > main{ExitableDirectoryReader(UninvertingDirectoryReader())} > 10958 INFO (coreZkRegister-47-thread-1-processing-n:127.0.0.1:36014_solr > x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) > [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 > x:collection1_shard2_replica2] o.a.s.c.ZkController Core needs to > recover:collection1_shard2_replica2 > 10964 INFO (updateExecutor-20-thread-1-processing-n:127.0.0.1:36014_solr > x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) > [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 > x:collection1_shard2_replica2] o.a.s.u.DefaultSolrCoreState Running recovery > 10970 INFO (recoveryExecutor-21-thread-1-processing-n:127.0.0.1:36014_solr > x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) > [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 > x:collection1_shard2_replica2] o.a.s.c.RecoveryStrategy Starting recovery > process. recoveringAfterStartup=true > 10971 INFO (recoveryExecutor-21-thread-1-processing-n:127.0.0.1:36014_solr > x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) > [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 > x:collection1_shard2_replica2] o.a.s.c.RecoveryStrategy ###### > startupVersions=[[-1556034709770731520]] > 10971 INFO (recoveryExecutor-21-thread-1-processing-n:127.0.0.1:36014_solr > x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) > [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 > x:collection1_shard2_replica2] o.a.s.c.RecoveryStrategy Begin buffering > updates. core=[collection1_shard2_replica2] > 10971 INFO (recoveryExecutor-21-thread-1-processing-n:127.0.0.1:36014_solr > x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) > [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 > x:collection1_shard2_replica2] o.a.s.u.UpdateLog Starting to buffer updates. > FSUpdateLog{state=ACTIVE, tlog=null} > 10971 INFO (recoveryExecutor-21-thread-1-processing-n:127.0.0.1:36014_solr > x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) > [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 > x:collection1_shard2_replica2] o.a.s.c.RecoveryStrategy Publishing state of > core [collection1_shard2_replica2] as recovering, leader is > [http://127.0.0.1:42549/solr/collection1_shard2_replica1/] and I am > [http://127.0.0.1:36014/solr/collection1_shard2_replica2/] > 11129 INFO (recoveryExecutor-21-thread-1-processing-n:127.0.0.1:36014_solr > x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) > [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 > x:collection1_shard2_replica2] o.a.s.c.RecoveryStrategy Sending prep recovery > command to [http://127.0.0.1:42549/solr]; [WaitForState: > action=PREPRECOVERY&core=collection1_shard2_replica1&nodeName=127.0.0.1:36014_solr&coreNodeName=core_node4&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true] > 11131 INFO (zkCallback-12-thread-2-processing-n:127.0.0.1:42549_solr) > [n:127.0.0.1:42549_solr ] o.a.s.c.c.ZkStateReader A cluster state change: > [WatchedEvent state:SyncConnected type:NodeDataChanged > path:/collections/collection1/state.json] for collection [collection1] has > occurred - updating... (live nodes size: [2]) > 11131 INFO (zkCallback-23-thread-1-processing-n:127.0.0.1:36014_solr) > [n:127.0.0.1:36014_solr ] o.a.s.c.c.ZkStateReader A cluster state change: > [WatchedEvent state:SyncConnected type:NodeDataChanged > path:/collections/collection1/state.json] for collection [collection1] has > occurred - updating... (live nodes size: [2]) > 11133 INFO (qtp344759431-38) [n:127.0.0.1:42549_solr ] > o.a.s.h.a.PrepRecoveryOp Going to wait for coreNodeName: core_node4, state: > recovering, checkLive: true, onlyIfLeader: true, onlyIfLeaderActive: true > 11140 INFO (qtp344759431-38) [n:127.0.0.1:42549_solr ] > o.a.s.h.a.PrepRecoveryOp Will wait a max of 183 seconds to see > collection1_shard2_replica1 (shard2 of collection1) have state: recovering > 11140 INFO (qtp344759431-38) [n:127.0.0.1:42549_solr ] > o.a.s.h.a.PrepRecoveryOp In WaitForState(recovering): collection=collection1, > shard=shard2, thisCore=collection1_shard2_replica1, > leaderDoesNotNeedRecovery=false, isLeader? true, live=true, checkLive=true, > currentState=recovering, localState=active, nodeName=127.0.0.1:36014_solr, > coreNodeName=core_node4, onlyIfActiveCheckResult=false, nodeProps: > core_node4:{"core":"collection1_shard2_replica2","base_url":"http://127.0.0.1:36014/solr","node_name":"127.0.0.1:36014_solr","state":"recovering"} > 11140 INFO (qtp344759431-38) [n:127.0.0.1:42549_solr ] > o.a.s.h.a.PrepRecoveryOp Waited coreNodeName: core_node4, state: recovering, > checkLive: true, onlyIfLeader: true for: 0 seconds. > 11140 INFO (qtp344759431-38) [n:127.0.0.1:42549_solr ] > o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores > params={nodeName=127.0.0.1:36014_solr&onlyIfLeaderActive=true&core=collection1_shard2_replica1&coreNodeName=core_node4&action=PREPRECOVERY&checkLive=true&state=recovering&onlyIfLeader=true&wt=javabin&version=2} > status=0 QTime=7 > 11160 INFO (coreLoadExecutor-52-thread-3-processing-n:127.0.0.1:36014_solr) > [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 > x:collection2_shard1_replica2] o.a.s.s.SolrIndexSearcher Opening > [Searcher@28717b92[collection2_shard1_replica2] main] > 11161 INFO (coreLoadExecutor-52-thread-3-processing-n:127.0.0.1:36014_solr) > [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 > x:collection2_shard1_replica2] o.a.s.r.ManagedResourceStorage Configured > ZooKeeperStorageIO with znodeBase: /configs/config > 11162 INFO (coreLoadExecutor-52-thread-3-processing-n:127.0.0.1:36014_solr) > [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 > x:collection2_shard1_replica2] o.a.s.r.ManagedResourceStorage Loaded null at > path _rest_managed.json using ZooKeeperStorageIO:path=/configs/config > 11162 INFO (coreLoadExecutor-52-thread-3-processing-n:127.0.0.1:36014_solr) > [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 > x:collection2_shard1_replica2] o.a.s.h.ReplicationHandler Commits will be > reserved for 10000 > 11162 INFO (searcherExecutor-54-thread-1-processing-n:127.0.0.1:36014_solr > x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) > [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 > x:collection2_shard1_replica2] o.a.s.c.SolrCore [collection2_shard1_replica2] > Registered new searcher Searcher@28717b92[collection2_shard1_replica2] > main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_0(7.0.0):C1)))} > 11168 INFO (coreZkRegister-47-thread-1-processing-n:127.0.0.1:36014_solr > x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) > [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 > x:collection2_shard1_replica2] o.a.s.c.ZkController Core needs to > recover:collection2_shard1_replica2 > 11168 INFO (updateExecutor-20-thread-2-processing-n:127.0.0.1:36014_solr > x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) > [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 > x:collection2_shard1_replica2] o.a.s.u.DefaultSolrCoreState Running recovery > 11168 INFO (recoveryExecutor-21-thread-2-processing-n:127.0.0.1:36014_solr > x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) > [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 > x:collection2_shard1_replica2] o.a.s.c.RecoveryStrategy Starting recovery > process. recoveringAfterStartup=true > 11169 INFO (recoveryExecutor-21-thread-2-processing-n:127.0.0.1:36014_solr > x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) > [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 > x:collection2_shard1_replica2] o.a.s.c.RecoveryStrategy ###### > startupVersions=[[1556034713508904960]] > 11169 INFO (recoveryExecutor-21-thread-2-processing-n:127.0.0.1:36014_solr > x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) > [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 > x:collection2_shard1_replica2] o.a.s.c.RecoveryStrategy Begin buffering > updates. core=[collection2_shard1_replica2] > 11169 INFO (recoveryExecutor-21-thread-2-processing-n:127.0.0.1:36014_solr > x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) > [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 > x:collection2_shard1_replica2] o.a.s.u.UpdateLog Starting to buffer updates. > FSUpdateLog{state=ACTIVE, tlog=null} > 11169 INFO (recoveryExecutor-21-thread-2-processing-n:127.0.0.1:36014_solr > x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) > [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 > x:collection2_shard1_replica2] o.a.s.c.RecoveryStrategy Publishing state of > core [collection2_shard1_replica2] as recovering, leader is > [http://127.0.0.1:42549/solr/collection2_shard1_replica1/] and I am > [http://127.0.0.1:36014/solr/collection2_shard1_replica2/] > 11171 INFO (recoveryExecutor-21-thread-2-processing-n:127.0.0.1:36014_solr > x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) > [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 > x:collection2_shard1_replica2] o.a.s.c.RecoveryStrategy Sending prep recovery > command to [http://127.0.0.1:42549/solr]; [WaitForState: > action=PREPRECOVERY&core=collection2_shard1_replica1&nodeName=127.0.0.1:36014_solr&coreNodeName=core_node1&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true] > 11174 INFO (qtp344759431-37) [n:127.0.0.1:42549_solr ] > o.a.s.h.a.PrepRecoveryOp Going to wait for coreNodeName: core_node1, state: > recovering, checkLive: true, onlyIfLeader: true, onlyIfLeaderActive: true > 11175 INFO (qtp344759431-37) [n:127.0.0.1:42549_solr ] > o.a.s.h.a.PrepRecoveryOp Will wait a max of 183 seconds to see > collection2_shard1_replica1 (shard1 of collection2) have state: recovering > 11175 INFO (qtp344759431-37) [n:127.0.0.1:42549_solr ] > o.a.s.h.a.PrepRecoveryOp In WaitForState(recovering): collection=collection2, > shard=shard1, thisCore=collection2_shard1_replica1, > leaderDoesNotNeedRecovery=false, isLeader? true, live=true, checkLive=true, > currentState=down, localState=active, nodeName=127.0.0.1:36014_solr, > coreNodeName=core_node1, onlyIfActiveCheckResult=false, nodeProps: > core_node1:{"core":"collection2_shard1_replica2","base_url":"http://127.0.0.1:36014/solr","node_name":"127.0.0.1:36014_solr","state":"down"} > 11278 INFO (zkCallback-23-thread-1-processing-n:127.0.0.1:36014_solr) > [n:127.0.0.1:36014_solr ] o.a.s.c.c.ZkStateReader A cluster state change: > [WatchedEvent state:SyncConnected type:NodeDataChanged > path:/collections/collection2/state.json] for collection [collection2] has > occurred - updating... (live nodes size: [2]) > 11278 INFO (zkCallback-12-thread-2-processing-n:127.0.0.1:42549_solr) > [n:127.0.0.1:42549_solr ] o.a.s.c.c.ZkStateReader A cluster state change: > [WatchedEvent state:SyncConnected type:NodeDataChanged > path:/collections/collection2/state.json] for collection [collection2] has > occurred - updating... (live nodes size: [2]) > 11323 ERROR > (updateExecutor-7-thread-1-processing-http:////127.0.0.1:36014//solr//collection2_shard1_replica2 > x:collection2_shard1_replica1 r:core_node2 n:127.0.0.1:42549_solr s:shard1 > c:collection2) [n:127.0.0.1:42549_solr c:collection2 s:shard1 r:core_node2 > x:collection2_shard1_replica1] o.a.s.u.StreamingSolrClients error > java.net.SocketException: Broken pipe (Write failed) > at java.net.SocketOutputStream.socketWrite0(Native Method) > at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) > at java.net.SocketOutputStream.write(SocketOutputStream.java:153) > at > org.apache.http.impl.io.SessionOutputBufferImpl.streamWrite(SessionOutputBufferImpl.java:126) > at > org.apache.http.impl.io.SessionOutputBufferImpl.flushBuffer(SessionOutputBufferImpl.java:138) > at > org.apache.http.impl.io.SessionOutputBufferImpl.flush(SessionOutputBufferImpl.java:146) > at > org.apache.http.impl.io.ChunkedOutputStream.close(ChunkedOutputStream.java:205) > at > org.apache.http.impl.DefaultBHttpClientConnection.sendRequestEntity(DefaultBHttpClientConnection.java:159) > at > org.apache.http.impl.conn.CPoolProxy.sendRequestEntity(CPoolProxy.java:162) > at > org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:237) > at > org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:122) > at > org.apache.solr.util.stats.InstrumentedHttpRequestExecutor.execute(InstrumentedHttpRequestExecutor.java:61) > at > org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:271) > at > org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184) > at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88) > at > org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) > at > org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:323) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:186) > at > com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org