[ 
https://issues.apache.org/jira/browse/SOLR-15052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253612#comment-17253612
 ] 

Ilan Ginzburg commented on SOLR-15052:
--------------------------------------

I agree [~noble.paul]. There's no serialization, I meant to say the list of 
znode children is basically read for each replica update.

Say we get a sequence of updates for {{R1, R2, R3, R4, R5}}.
Assuming {{R1}} and {{R2}} arrive at the same time, they can be executed at the 
same time as per your example (assuming the {{DocCollection 
getPerReplicaStates()}} is up to date).
If slightly later {{R3}} and {{R4}} arrive together, each is going to see a 
changed {{cversion}} if they haven't re-read the directory listing after the 
{{R1}} & {{R2}} update. Each is going to re-read the directory before executing 
the update (done in [PerReplicaStates.fetch 
L166|https://github.com/apache/lucene-solr/pull/2148/files#diff-0bd8a828302915c525c8df3e8cccdc9881ebad121359c0dbc8374b8b72995669R166]
 called from [ZkController.publish 
L1622|https://github.com/apache/lucene-solr/pull/2148/files#diff-5b63503605ede4384429e74d1fa0c410adc5da8f3246e8c36e49feff2f3ea692R1622]
 before the call to [PerReplicaStates.persist 
L107|https://github.com/apache/lucene-solr/pull/2148/files#diff-0bd8a828302915c525c8df3e8cccdc9881ebad121359c0dbc8374b8b72995669R107]
 doing the actual [multi 
(L136)|https://github.com/apache/lucene-solr/pull/2148/files#diff-0bd8a828302915c525c8df3e8cccdc9881ebad121359c0dbc8374b8b72995669R136]
 operation).
Then the {{R5}} update is also going to read the directory listing and execute.

Basically, unless the {{PerReplicaStates}} stored in {{DocCollection}} is up to 
date for other reasons and new update requests arrive at exactly the same time, 
then each new replica update request triggers a new read of the directory 
listing. Updates are not serialized ({{R3}} and {{R4}} can execute in 
parallel), but there's some inefficiency in the way they're handled.

I wanted to see the actual impact of this. Based on [~ichattopadhyaya]'s test 
[StateListVsCASSpinlock.java|https://raw.githubusercontent.com/chatman/experiments/main/src/main/java/StateListVsCASSpinlock.java]
 I tried to get an idea of the costs of the different actions. With 500 
children znodes, {{getChildren}} took on my laptop about 10-15ms while 
{{getData}} on a single file with equivalent amount of text took longer at 
~20ms. This came as a surprise to me.

The multi operation (delete znode, create znode) took about 40ms while the CAS 
of the text file was faster at 30ms, but there were many retries in CAS as 
expected that considerably slowed down the process (got a speedup of over 10x 
by using the independent znodes vs a single text file with CAS with 500 
replicas).

The implementation in the PR could easily avoid systematically re-reading the 
znode children list by attempting the multi operation on the cached 
{{PerReplicaStates}} of the {{DocCollection}} (if not {{null}}). Only if the 
multi fails should it re-read the directory listing and try again. Maybe not 
worth it at this point though (but something to keep in mind).

> Reducing overseer bottlenecks using per-replica states
> ------------------------------------------------------
>
>                 Key: SOLR-15052
>                 URL: https://issues.apache.org/jira/browse/SOLR-15052
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Ishan Chattopadhyaya
>            Priority: Major
>         Attachments: per-replica-states-gcp.pdf
>
>          Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> This work has the same goal as SOLR-13951, that is to reduce overseer 
> bottlenecks by avoiding replica state updates from going to the state.json 
> via the overseer. However, the approach taken here is different from 
> SOLR-13951 and hence this work supercedes that work.
> The design proposed is here: 
> https://docs.google.com/document/d/1xdxpzUNmTZbk0vTMZqfen9R3ArdHokLITdiISBxCFUg/edit
> Briefly,
> # Every replica's state will be in a separate znode nested under the 
> state.json. It has the name that encodes the replica name, state, leadership 
> status.
> # An additional children watcher to be set on state.json for state changes.
> # Upon a state change, a ZK multi-op to delete the previous znode and add a 
> new znode with new state.
> Differences between this and SOLR-13951,
> # In SOLR-13951, we planned to leverage shard terms for per shard states.
> # As a consequence, the code changes required for SOLR-13951 were massive (we 
> needed a shard state provider abstraction and introduce it everywhere in the 
> codebase).
> # This approach is a drastically simpler change and design.
> Credits for this design and the PR is due to [~noble.paul]. 
> [~markrmil...@gmail.com], [~noble.paul] and I have collaborated on this 
> effort. The reference branch takes a conceptually similar (but not identical) 
> approach.
> I shall attach a PR and performance benchmarks shortly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to