[ https://issues.apache.org/jira/browse/SOLR-6816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14383556#comment-14383556 ]
Per Steffensen commented on SOLR-6816: -------------------------------------- Just to clarify. We did our own implementation of bloom-filter, and did not build on the existing feature, because we did not find it satisfactory. I do not remember exactly why, but maybe it was because the current solution builds a bloom-filter on each segment, and bloom-filters are not really mergeable unless they have the same size and used the same hashing-algorithm (and number of hashes performed). To be efficient you want your bloom-filter to be bigger (and potentially do another number of hashes) the bigger your segment is. We decided to do a one-bloom-filter-per-index/core/replica (or whatever you like to call it) solution, so that you do not need to merge bloom-filters. Besides that you can tell Solr about the maximum expected amount of documents in the index and the FPP you want at that number of documents, and it will calculate the size of the bloom-filter to use. Because such a one-bloom-filter-per-index does not go very well hand-in-hand with Lucenes segmenting strategy, we implemented it as a Solr-feature, that is, it is Solr that maintains the bloom-filter and the feature is not available for people who just uses Lucene. It is not a simple task to do it right, so if the lower-hanging fruits are satisfactory, you should pick them first. To recap, we have recently-updated-or-lookedup cache to be able to efficiently find the version of a document that does actually exist (and was "touched" recently), and bloom-filter to be able to efficiently find out if a document does not exist. > Review SolrCloud Indexing Performance. > -------------------------------------- > > Key: SOLR-6816 > URL: https://issues.apache.org/jira/browse/SOLR-6816 > Project: Solr > Issue Type: Task > Components: SolrCloud > Reporter: Mark Miller > Priority: Critical > Attachments: SolrBench.pdf > > > We have never really focused on indexing performance, just correctness and > low hanging fruit. We need to vet the performance and try to address any > holes. > Note: A common report is that adding any replication is very slow. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org