[ 
https://issues.apache.org/jira/browse/SOLR-9366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15403321#comment-15403321
 ] 

Shalin Shekhar Mangar commented on SOLR-9366:
---------------------------------------------

Yes, this is to be expected. The reason is that the queries execute before the 
new state (i.e. 'down') is propagated to all the nodes.

You can see this in the logs (the following excerpt is for shard5 only):
{code}
   [junit4]   2> 96432 INFO  
(updateExecutor-29-thread-10-processing-x:test_col_shard5_replica2 r:core_node1 
http:////127.0.0.1:35941//solr//test_col_shard5_replica1// 
n:127.0.0.1:36018_solr s:shard5 c:test_col) [n:127.0.0.1:36018_solr c:test_col 
s:shard5 r:core_node1 x:test_col_shard5_replica2] 
o.a.s.c.LeaderInitiatedRecoveryThread Asking core=test_col_shard5_replica1 
coreNodeName=core_node6 on http://127.0.0.1:35941/solr to recover
....
   [junit4]   2> 96458 INFO  (qtp1947867562-51) [n:127.0.0.1:35941_solr    ] 
o.a.s.h.a.CoreAdminOperation It has been requested that we recover: 
core=test_col_shard5_replica1
....
   [junit4]   2> 96510 INFO  
(OverseerStateUpdate-96343095816486940-127.0.0.1:36018_solr-n_0000000000) 
[n:127.0.0.1:36018_solr    ] o.a.s.c.o.ReplicaMutator Update state numShards=5 
message={
   [junit4]   2>   "core":"test_col_shard5_replica1",
   [junit4]   2>   "core_node_name":"core_node6",
   [junit4]   2>   "roles":null,
   [junit4]   2>   "base_url":"http://127.0.0.1:35941/solr";,
   [junit4]   2>   "node_name":"127.0.0.1:35941_solr",
   [junit4]   2>   "numShards":"5",
   [junit4]   2>   "state":"recovering",
   [junit4]   2>   "shard":"shard5",
   [junit4]   2>   "collection":"test_col",
   [junit4]   2>   "operation":"state"}
....
   [junit4]   2> 96527 INFO  (qtp623658064-525) [n:127.0.0.1:36018_solr    ] 
o.a.s.h.a.CoreAdminOperation Will wait a max of 183 seconds to see 
test_col_shard5_replica2 (shard5 of test_col) have state: recovering
....
   [junit4]   2> 96713 INFO  
(OverseerStateUpdate-96343095816486940-127.0.0.1:36018_solr-n_0000000000) 
[n:127.0.0.1:36018_solr    ] o.a.s.c.o.ZkStateWriter going to update_collection 
/collections/test_col/state.json version: 14
....
[junit4]   2> 96715 INFO  
(zkCallback-45-thread-2-processing-n:127.0.0.1:50832_solr) 
[n:127.0.0.1:50832_solr    ] o.a.s.c.c.ZkStateReader Updating data for 
[test_col] from [14] to [15]
[junit4]   2> 96715 INFO  
(zkCallback-50-thread-5-processing-n:127.0.0.1:36018_solr) 
[n:127.0.0.1:36018_solr    ] o.a.s.c.c.ZkStateReader Updating data for 
[test_col] from [14] to [15]
{code}

# Time: 96432 -- The shard5 leader asks replica shard5_replica1 to recover.
# Time: 96458 -- The replica shard5_replica1 receives the recovery request. At 
this point, it internally marks itself in recovering state.
# Time: 96510 -- The message to publish shard5_replica1 as 'down' is received 
at overseer
# Time: 96527 -- The shard5 leader i.e. shard5_replica2 is still waiting to see 
shard5_replica1 as down. Note that around this time, the query is being executed
# Time: 96713 -- The overseer actually writes the new collection state to 
ZooKeeper. The delay is because the overseer buffers writes to ZK. The query 
has already executed by 96534.
# Time: 96715 -- The new collection state is finally available at the relevant 
nodes.

In summary, state updates cannot be instantaneously propagated to the entire 
cluster and this information asymmetry will always exist unless we implement 
consensus for each update. Also note that this is a *temporary* issue because 
the replica in recovery will not be selected for shard requests once the state 
is propagated across the cluster.

It is a bad idea to use Solr as a consistent data store like this test does. 
The test can do one of the following:
# Wait for recovery to complete before asserting doc counts,
# Use min_rf with updates and assert that all documents which made to both 
replicas are returned in the query results
# Query each shard leader individually (distrib=false) and sum up the numFound 
on the client

That being said, we can improve a few things:
# If the query happens to be aggregated by the leader, it can take LIR into 
account while selecting a replica for the non-distrib request
# The overseer can help by flushing 'down' state updates to ZK faster and not 
perform buffering.
# Provide a way to query leaders only for full consistency
# Use LIR information from ZK in addition to live_nodes to select replica for 
non-distrib request -- I am not sure we should do this. If we do, we should 
make sure we do not hit ZK in the fast path.

None of the above will completely avoid this problem but they will make it less 
probable.

> replicas in LIR seem to still be participating in search requests
> -----------------------------------------------------------------
>
>                 Key: SOLR-9366
>                 URL: https://issues.apache.org/jira/browse/SOLR-9366
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>         Attachments: SOLR-9366.txt.gz
>
>
> Spinning this off from SOLR-9363 where sarowe's jenkins encountered a strange 
> test failure when TestInjection is used causing replicas to return errors on 
> some requests.
> Reading over the logs it appears that when this happens, and the replicas are 
> put into LIR they then ignore an explicit user requested commit (ie: 
> {{Ignoring commit while not ACTIVE - state: BUFFERING replay: false}}) but 
> still participate in queries -- apparently before they finish recovery (or at 
> the very least before / with-out doing the commit/openSearcher that they 
> previously ignored.
> Details and logs to follow...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to