[ 
https://issues.apache.org/jira/browse/SOLR-9945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830097#comment-15830097
 ] 

Mark Miller commented on SOLR-9945:
-----------------------------------

bq. first set the node's state to DOWN,

It's a bug to do that, see my comment in SOLR-9555.

> LIR should check the node is recovering before bring it down
> ------------------------------------------------------------
>
>                 Key: SOLR-9945
>                 URL: https://issues.apache.org/jira/browse/SOLR-9945
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Cao Manh Dat
>         Attachments: SOLR-9945.patch
>
>
> When a node is recovering, the leader can meet an exception when trying to 
> send an update to the buffering node. So the leader will try running LIR 
> process: first set the node's state to DOWN, then send recovery OP to the 
> node.
> In the same time, PrepRecoveryOp will make the leader wait for a very long 
> time to see the node's state is RECOVERING. 
> This scenario can easily be achieved by using this test
> {code}
> String collection = "collection2";
> CollectionAdminRequest
>     .createCollection(collection, "config", 1, 2)
>     .setMaxShardsPerNode(1)
>     .process(cluster.getSolrClient());
> AbstractDistribZkTestBase.waitForRecoveriesToFinish(collection, 
> cluster.getSolrClient().getZkStateReader(),
>     false, true, 30);
> CloudSolrClient cloudClient = cluster.getSolrClient();
> DocCollection docCollection = 
> cloudClient.getZkStateReader().getClusterState().getCollection(collection);
> Slice slice = docCollection.getSlice("shard1");
> Replica replicaNode = slice.getReplicas(replica -> replica != 
> slice.getLeader()).get(0);
> JettySolrRunner replicaRunner = cluster.getReplicaJetty(replicaNode);
> new UpdateRequest()
>     .add(sdoc("id", "1"))
>     .process(cloudClient, collection);
> ChaosMonkey.stop(replicaRunner);
> new UpdateRequest()
>     .add(sdoc("id", "2"))
>     .process(cloudClient, collection);
> ChaosMonkey.start(replicaRunner);
> new UpdateRequest()
>     .add(sdoc("id", "3"))
>     .process(cloudClient, collection);
> AbstractDistribZkTestBase.waitForRecoveriesToFinish(collection, 
> cluster.getSolrClient().getZkStateReader(),
>     false, true, 60);
> CollectionAdminRequest
>     .deleteCollection(collection)
>     .process(cloudClient);  
> {code}
> Here are some log 
> {code}
> 10942 INFO  (searcherExecutor-53-thread-1-processing-n:127.0.0.1:36014_solr 
> x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) 
> [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 
> x:collection1_shard2_replica2] o.a.s.c.SolrCore [collection1_shard2_replica2] 
> Registered new searcher Searcher@510d1ea5[collection1_shard2_replica2] 
> main{ExitableDirectoryReader(UninvertingDirectoryReader())}
> 10942 INFO  (searcherExecutor-56-thread-1-processing-n:127.0.0.1:36014_solr 
> x:collection1_shard1_replica2 s:shard1 c:collection1 r:core_node3) 
> [n:127.0.0.1:36014_solr c:collection1 s:shard1 r:core_node3 
> x:collection1_shard1_replica2] o.a.s.c.SolrCore [collection1_shard1_replica2] 
> Registered new searcher Searcher@505d7839[collection1_shard1_replica2] 
> main{ExitableDirectoryReader(UninvertingDirectoryReader())}
> 10958 INFO  (coreZkRegister-47-thread-1-processing-n:127.0.0.1:36014_solr 
> x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) 
> [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 
> x:collection1_shard2_replica2] o.a.s.c.ZkController Core needs to 
> recover:collection1_shard2_replica2
> 10964 INFO  (updateExecutor-20-thread-1-processing-n:127.0.0.1:36014_solr 
> x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) 
> [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 
> x:collection1_shard2_replica2] o.a.s.u.DefaultSolrCoreState Running recovery
> 10970 INFO  (recoveryExecutor-21-thread-1-processing-n:127.0.0.1:36014_solr 
> x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) 
> [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 
> x:collection1_shard2_replica2] o.a.s.c.RecoveryStrategy Starting recovery 
> process. recoveringAfterStartup=true
> 10971 INFO  (recoveryExecutor-21-thread-1-processing-n:127.0.0.1:36014_solr 
> x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) 
> [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 
> x:collection1_shard2_replica2] o.a.s.c.RecoveryStrategy ###### 
> startupVersions=[[-1556034709770731520]]
> 10971 INFO  (recoveryExecutor-21-thread-1-processing-n:127.0.0.1:36014_solr 
> x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) 
> [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 
> x:collection1_shard2_replica2] o.a.s.c.RecoveryStrategy Begin buffering 
> updates. core=[collection1_shard2_replica2]
> 10971 INFO  (recoveryExecutor-21-thread-1-processing-n:127.0.0.1:36014_solr 
> x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) 
> [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 
> x:collection1_shard2_replica2] o.a.s.u.UpdateLog Starting to buffer updates. 
> FSUpdateLog{state=ACTIVE, tlog=null}
> 10971 INFO  (recoveryExecutor-21-thread-1-processing-n:127.0.0.1:36014_solr 
> x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) 
> [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 
> x:collection1_shard2_replica2] o.a.s.c.RecoveryStrategy Publishing state of 
> core [collection1_shard2_replica2] as recovering, leader is 
> [http://127.0.0.1:42549/solr/collection1_shard2_replica1/] and I am 
> [http://127.0.0.1:36014/solr/collection1_shard2_replica2/]
> 11129 INFO  (recoveryExecutor-21-thread-1-processing-n:127.0.0.1:36014_solr 
> x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) 
> [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 
> x:collection1_shard2_replica2] o.a.s.c.RecoveryStrategy Sending prep recovery 
> command to [http://127.0.0.1:42549/solr]; [WaitForState: 
> action=PREPRECOVERY&core=collection1_shard2_replica1&nodeName=127.0.0.1:36014_solr&coreNodeName=core_node4&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true]
> 11131 INFO  (zkCallback-12-thread-2-processing-n:127.0.0.1:42549_solr) 
> [n:127.0.0.1:42549_solr    ] o.a.s.c.c.ZkStateReader A cluster state change: 
> [WatchedEvent state:SyncConnected type:NodeDataChanged 
> path:/collections/collection1/state.json] for collection [collection1] has 
> occurred - updating... (live nodes size: [2])
> 11131 INFO  (zkCallback-23-thread-1-processing-n:127.0.0.1:36014_solr) 
> [n:127.0.0.1:36014_solr    ] o.a.s.c.c.ZkStateReader A cluster state change: 
> [WatchedEvent state:SyncConnected type:NodeDataChanged 
> path:/collections/collection1/state.json] for collection [collection1] has 
> occurred - updating... (live nodes size: [2])
> 11133 INFO  (qtp344759431-38) [n:127.0.0.1:42549_solr    ] 
> o.a.s.h.a.PrepRecoveryOp Going to wait for coreNodeName: core_node4, state: 
> recovering, checkLive: true, onlyIfLeader: true, onlyIfLeaderActive: true
> 11140 INFO  (qtp344759431-38) [n:127.0.0.1:42549_solr    ] 
> o.a.s.h.a.PrepRecoveryOp Will wait a max of 183 seconds to see 
> collection1_shard2_replica1 (shard2 of collection1) have state: recovering
> 11140 INFO  (qtp344759431-38) [n:127.0.0.1:42549_solr    ] 
> o.a.s.h.a.PrepRecoveryOp In WaitForState(recovering): collection=collection1, 
> shard=shard2, thisCore=collection1_shard2_replica1, 
> leaderDoesNotNeedRecovery=false, isLeader? true, live=true, checkLive=true, 
> currentState=recovering, localState=active, nodeName=127.0.0.1:36014_solr, 
> coreNodeName=core_node4, onlyIfActiveCheckResult=false, nodeProps: 
> core_node4:{"core":"collection1_shard2_replica2","base_url":"http://127.0.0.1:36014/solr","node_name":"127.0.0.1:36014_solr","state":"recovering"}
> 11140 INFO  (qtp344759431-38) [n:127.0.0.1:42549_solr    ] 
> o.a.s.h.a.PrepRecoveryOp Waited coreNodeName: core_node4, state: recovering, 
> checkLive: true, onlyIfLeader: true for: 0 seconds.
> 11140 INFO  (qtp344759431-38) [n:127.0.0.1:42549_solr    ] 
> o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores 
> params={nodeName=127.0.0.1:36014_solr&onlyIfLeaderActive=true&core=collection1_shard2_replica1&coreNodeName=core_node4&action=PREPRECOVERY&checkLive=true&state=recovering&onlyIfLeader=true&wt=javabin&version=2}
>  status=0 QTime=7
> 11160 INFO  (coreLoadExecutor-52-thread-3-processing-n:127.0.0.1:36014_solr) 
> [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 
> x:collection2_shard1_replica2] o.a.s.s.SolrIndexSearcher Opening 
> [Searcher@28717b92[collection2_shard1_replica2] main]
> 11161 INFO  (coreLoadExecutor-52-thread-3-processing-n:127.0.0.1:36014_solr) 
> [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 
> x:collection2_shard1_replica2] o.a.s.r.ManagedResourceStorage Configured 
> ZooKeeperStorageIO with znodeBase: /configs/config
> 11162 INFO  (coreLoadExecutor-52-thread-3-processing-n:127.0.0.1:36014_solr) 
> [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 
> x:collection2_shard1_replica2] o.a.s.r.ManagedResourceStorage Loaded null at 
> path _rest_managed.json using ZooKeeperStorageIO:path=/configs/config
> 11162 INFO  (coreLoadExecutor-52-thread-3-processing-n:127.0.0.1:36014_solr) 
> [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 
> x:collection2_shard1_replica2] o.a.s.h.ReplicationHandler Commits will be 
> reserved for  10000
> 11162 INFO  (searcherExecutor-54-thread-1-processing-n:127.0.0.1:36014_solr 
> x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) 
> [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 
> x:collection2_shard1_replica2] o.a.s.c.SolrCore [collection2_shard1_replica2] 
> Registered new searcher Searcher@28717b92[collection2_shard1_replica2] 
> main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_0(7.0.0):C1)))}
> 11168 INFO  (coreZkRegister-47-thread-1-processing-n:127.0.0.1:36014_solr 
> x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) 
> [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 
> x:collection2_shard1_replica2] o.a.s.c.ZkController Core needs to 
> recover:collection2_shard1_replica2
> 11168 INFO  (updateExecutor-20-thread-2-processing-n:127.0.0.1:36014_solr 
> x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) 
> [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 
> x:collection2_shard1_replica2] o.a.s.u.DefaultSolrCoreState Running recovery
> 11168 INFO  (recoveryExecutor-21-thread-2-processing-n:127.0.0.1:36014_solr 
> x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) 
> [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 
> x:collection2_shard1_replica2] o.a.s.c.RecoveryStrategy Starting recovery 
> process. recoveringAfterStartup=true
> 11169 INFO  (recoveryExecutor-21-thread-2-processing-n:127.0.0.1:36014_solr 
> x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) 
> [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 
> x:collection2_shard1_replica2] o.a.s.c.RecoveryStrategy ###### 
> startupVersions=[[1556034713508904960]]
> 11169 INFO  (recoveryExecutor-21-thread-2-processing-n:127.0.0.1:36014_solr 
> x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) 
> [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 
> x:collection2_shard1_replica2] o.a.s.c.RecoveryStrategy Begin buffering 
> updates. core=[collection2_shard1_replica2]
> 11169 INFO  (recoveryExecutor-21-thread-2-processing-n:127.0.0.1:36014_solr 
> x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) 
> [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 
> x:collection2_shard1_replica2] o.a.s.u.UpdateLog Starting to buffer updates. 
> FSUpdateLog{state=ACTIVE, tlog=null}
> 11169 INFO  (recoveryExecutor-21-thread-2-processing-n:127.0.0.1:36014_solr 
> x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) 
> [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 
> x:collection2_shard1_replica2] o.a.s.c.RecoveryStrategy Publishing state of 
> core [collection2_shard1_replica2] as recovering, leader is 
> [http://127.0.0.1:42549/solr/collection2_shard1_replica1/] and I am 
> [http://127.0.0.1:36014/solr/collection2_shard1_replica2/]
> 11171 INFO  (recoveryExecutor-21-thread-2-processing-n:127.0.0.1:36014_solr 
> x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) 
> [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 
> x:collection2_shard1_replica2] o.a.s.c.RecoveryStrategy Sending prep recovery 
> command to [http://127.0.0.1:42549/solr]; [WaitForState: 
> action=PREPRECOVERY&core=collection2_shard1_replica1&nodeName=127.0.0.1:36014_solr&coreNodeName=core_node1&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true]
> 11174 INFO  (qtp344759431-37) [n:127.0.0.1:42549_solr    ] 
> o.a.s.h.a.PrepRecoveryOp Going to wait for coreNodeName: core_node1, state: 
> recovering, checkLive: true, onlyIfLeader: true, onlyIfLeaderActive: true
> 11175 INFO  (qtp344759431-37) [n:127.0.0.1:42549_solr    ] 
> o.a.s.h.a.PrepRecoveryOp Will wait a max of 183 seconds to see 
> collection2_shard1_replica1 (shard1 of collection2) have state: recovering
> 11175 INFO  (qtp344759431-37) [n:127.0.0.1:42549_solr    ] 
> o.a.s.h.a.PrepRecoveryOp In WaitForState(recovering): collection=collection2, 
> shard=shard1, thisCore=collection2_shard1_replica1, 
> leaderDoesNotNeedRecovery=false, isLeader? true, live=true, checkLive=true, 
> currentState=down, localState=active, nodeName=127.0.0.1:36014_solr, 
> coreNodeName=core_node1, onlyIfActiveCheckResult=false, nodeProps: 
> core_node1:{"core":"collection2_shard1_replica2","base_url":"http://127.0.0.1:36014/solr","node_name":"127.0.0.1:36014_solr","state":"down"}
> 11278 INFO  (zkCallback-23-thread-1-processing-n:127.0.0.1:36014_solr) 
> [n:127.0.0.1:36014_solr    ] o.a.s.c.c.ZkStateReader A cluster state change: 
> [WatchedEvent state:SyncConnected type:NodeDataChanged 
> path:/collections/collection2/state.json] for collection [collection2] has 
> occurred - updating... (live nodes size: [2])
> 11278 INFO  (zkCallback-12-thread-2-processing-n:127.0.0.1:42549_solr) 
> [n:127.0.0.1:42549_solr    ] o.a.s.c.c.ZkStateReader A cluster state change: 
> [WatchedEvent state:SyncConnected type:NodeDataChanged 
> path:/collections/collection2/state.json] for collection [collection2] has 
> occurred - updating... (live nodes size: [2])
> 11323 ERROR 
> (updateExecutor-7-thread-1-processing-http:////127.0.0.1:36014//solr//collection2_shard1_replica2
>  x:collection2_shard1_replica1 r:core_node2 n:127.0.0.1:42549_solr s:shard1 
> c:collection2) [n:127.0.0.1:42549_solr c:collection2 s:shard1 r:core_node2 
> x:collection2_shard1_replica1] o.a.s.u.StreamingSolrClients error
> java.net.SocketException: Broken pipe (Write failed)
>       at java.net.SocketOutputStream.socketWrite0(Native Method)
>       at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
>       at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
>       at 
> org.apache.http.impl.io.SessionOutputBufferImpl.streamWrite(SessionOutputBufferImpl.java:126)
>       at 
> org.apache.http.impl.io.SessionOutputBufferImpl.flushBuffer(SessionOutputBufferImpl.java:138)
>       at 
> org.apache.http.impl.io.SessionOutputBufferImpl.flush(SessionOutputBufferImpl.java:146)
>       at 
> org.apache.http.impl.io.ChunkedOutputStream.close(ChunkedOutputStream.java:205)
>       at 
> org.apache.http.impl.DefaultBHttpClientConnection.sendRequestEntity(DefaultBHttpClientConnection.java:159)
>       at 
> org.apache.http.impl.conn.CPoolProxy.sendRequestEntity(CPoolProxy.java:162)
>       at 
> org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:237)
>       at 
> org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:122)
>       at 
> org.apache.solr.util.stats.InstrumentedHttpRequestExecutor.execute(InstrumentedHttpRequestExecutor.java:61)
>       at 
> org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:271)
>       at 
> org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
>       at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
>       at 
> org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
>       at 
> org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
>       at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
>       at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
>       at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:323)
>       at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:186)
>       at 
> com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
>       at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to