[ https://issues.apache.org/jira/browse/SOLR-6816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382834#comment-14382834 ]
Yonik Seeley commented on SOLR-6816: ------------------------------------ bq. > If we could initialize the highest versions correctly (would need to get from the index) ... bq. Can you expand on this a bit? I'm not clear how we would get the latest version of the bucket from the index? Let me back up for a minute and more fully describe things for others that may be following along. VersionInfo has a list of 256 version buckets. We hash the ID field and synchronize on the bucket to ensure that only one udpate with a given ID is being processed concurrently. This is the way we know which update "won" and can replicate that ordering on the replicas, etc. On replicas, things can be sent over multiple threads/connections, and updates can get reordered. We detect this by checking the version in the index (and tlog) and make sure there is nothing newer. We do the same synchronization on the version bucket here as well. If we maintain the highest version we've ever seen on the bucket (it's already there, VersionBucket.highest), then when an update comes in, we just compare the update version to the bucket.highest... if the update version is higher (which it almost always will be) then we know that this update wasn't reordered and we don't have to do any checking of the actual index. Maintaining VersionBucket.highest is simple too... we just update it each time we see a larger version. The only problem comes with initialization of "highest". If we start with a new index, we're all good. But if we start up with an existing index... we don't know what the highest version number in that index is, and hence we don't know what is safe to initialize VersionBucket.highest to. If it's a single node or if clocks are well synchronized, then we can just pick the current time as "highest". But if this index is replicated from another node, and the clock skew is more than the time it took to replicate the index, then it's possible that there is something in the index newer. It's actually pretty unlikely I think, but when we were starting off with all this cloud stuff I didn't want to introduce any more factors. Anyway, one way to solve the initialization problem is to simply look in the index and find the highest value for _version_. Then use that for all the buckets initial value (no need to hash the IDs and get it exact). > Review SolrCloud Indexing Performance. > -------------------------------------- > > Key: SOLR-6816 > URL: https://issues.apache.org/jira/browse/SOLR-6816 > Project: Solr > Issue Type: Task > Components: SolrCloud > Reporter: Mark Miller > Priority: Critical > Attachments: SolrBench.pdf > > > We have never really focused on indexing performance, just correctness and > low hanging fruit. We need to vet the performance and try to address any > holes. > Note: A common report is that adding any replication is very slow. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org