[ 
https://issues.apache.org/jira/browse/SOLR-6816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14383556#comment-14383556
 ] 

Per Steffensen commented on SOLR-6816:
--------------------------------------

Just to clarify. We did our own implementation of bloom-filter, and did not 
build on the existing feature, because we did not find it satisfactory. I do 
not remember exactly why, but maybe it was because the current solution builds 
a bloom-filter on each segment, and bloom-filters are not really mergeable 
unless they have the same size and used the same hashing-algorithm (and number 
of hashes performed). To be efficient you want your bloom-filter to be bigger 
(and potentially do another number of hashes) the bigger your segment is. We 
decided to do a one-bloom-filter-per-index/core/replica (or whatever you like 
to call it) solution, so that you do not need to merge bloom-filters. Besides 
that you can tell Solr about the maximum expected amount of documents in the 
index and the FPP you want at that number of documents, and it will calculate 
the size of the bloom-filter to use. Because such a one-bloom-filter-per-index 
does not go very well hand-in-hand with Lucenes segmenting strategy, we 
implemented it as a Solr-feature, that is, it is Solr that maintains the 
bloom-filter and the feature is not available for people who just uses Lucene.
It is not a simple task to do it right, so if the lower-hanging fruits are 
satisfactory, you should pick them first.
To recap, we have recently-updated-or-lookedup cache to be able to efficiently 
find the version of a document that does actually exist (and was "touched" 
recently), and bloom-filter to be able to efficiently find out if a document 
does not exist.

> Review SolrCloud Indexing Performance.
> --------------------------------------
>
>                 Key: SOLR-6816
>                 URL: https://issues.apache.org/jira/browse/SOLR-6816
>             Project: Solr
>          Issue Type: Task
>          Components: SolrCloud
>            Reporter: Mark Miller
>            Priority: Critical
>         Attachments: SolrBench.pdf
>
>
> We have never really focused on indexing performance, just correctness and 
> low hanging fruit. We need to vet the performance and try to address any 
> holes.
> Note: A common report is that adding any replication is very slow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to