[ 
https://issues.apache.org/jira/browse/SOLR-6816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382834#comment-14382834
 ] 

Yonik Seeley commented on SOLR-6816:
------------------------------------

bq. > If we could initialize the highest versions correctly (would need to get 
from the index) ...
bq. Can you expand on this a bit? I'm not clear how we would get the latest 
version of the bucket from the index?

Let me back up for a minute and more fully describe things for others that may 
be following along.

VersionInfo has a list of 256 version buckets.  We hash the ID field and 
synchronize on the bucket to ensure that only one udpate with a given ID is 
being processed concurrently.  This is the way we know which update "won" and 
can replicate that ordering on the replicas, etc.

On replicas, things can be sent over multiple threads/connections, and updates 
can get reordered.  We detect this by checking the version in the index (and 
tlog) and make sure there is nothing newer.  We do the same synchronization on 
the version bucket here as well.

If we maintain the highest version we've ever seen on the bucket (it's already 
there, VersionBucket.highest), then when an update comes in, we just compare 
the update version to the bucket.highest... if the update version is higher 
(which it almost always will be) then we know that this update wasn't reordered 
and we don't have to do any checking of the actual index. Maintaining 
VersionBucket.highest is simple too... we just update it each time we see a 
larger version.

The only problem comes with initialization of "highest".  If we start with a 
new index, we're all good.  But if we start up with an existing index... we 
don't know what the highest version number in that index is, and hence we don't 
know what is safe to initialize VersionBucket.highest  to.

If it's a single node or if clocks are well synchronized, then we can just pick 
the current time as "highest".  But if this index is replicated from another 
node,  and the clock skew is more than the time it took to replicate the index, 
then it's possible that there is something in the index newer.  It's actually 
pretty unlikely I think, but when we were starting off with all this cloud 
stuff I didn't want to introduce any more factors.

Anyway, one way to solve the initialization problem is to simply look in the 
index and find the highest value for _version_.  Then use that for all the 
buckets initial value (no need to hash the IDs and get it exact).

> Review SolrCloud Indexing Performance.
> --------------------------------------
>
>                 Key: SOLR-6816
>                 URL: https://issues.apache.org/jira/browse/SOLR-6816
>             Project: Solr
>          Issue Type: Task
>          Components: SolrCloud
>            Reporter: Mark Miller
>            Priority: Critical
>         Attachments: SolrBench.pdf
>
>
> We have never really focused on indexing performance, just correctness and 
> low hanging fruit. We need to vet the performance and try to address any 
> holes.
> Note: A common report is that adding any replication is very slow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to