[ https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14105715#comment-14105715 ]
Toke Eskildsen commented on SOLR-5894: -------------------------------------- Forked Lucene/Solr on GitHub for better project management: https://github.com/tokee/lucene-solr . Experimental branches for sparse faceting are lucene_solr_4_8_sparse and lucene_solr_4_9_sparse and will contain the latest code. The patch above for 4.7.1 is a little behind the GitHub branches, as it does not have speed up for second phase (getting counts for specific terms) of distributed faceting. > Speed up high-cardinality facets with sparse counters > ----------------------------------------------------- > > Key: SOLR-5894 > URL: https://issues.apache.org/jira/browse/SOLR-5894 > Project: Solr > Issue Type: Improvement > Components: SearchComponents - other > Affects Versions: 4.7.1 > Reporter: Toke Eskildsen > Priority: Minor > Labels: faceted-search, faceting, memory, performance > Attachments: SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, > SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, > SOLR-5894.patch, SOLR-5894.patch, SOLR-5894_test.zip, SOLR-5894_test.zip, > SOLR-5894_test.zip, SOLR-5894_test.zip, SOLR-5894_test.zip, > author_7M_tags_1852_logged_queries_warmed.png, > sparse_2000000docs_fc_cutoff_20140403-145412.png, > sparse_5000000docs_20140331-151918_multi.png, > sparse_5000000docs_20140331-151918_single.png, > sparse_50510000docs_20140328-152807.png > > > Field based faceting in Solr has two phases: Collecting counts for tags in > facets and extracting the requested tags. > The execution time for the collecting phase is approximately linear to the > number of hits and the number of references from hits to tags. This phase is > not the focus here. > The extraction time scales with the number of unique tags in the search > result, but is also heavily influenced by the total number of unique tags in > the facet as every counter, 0 or not, is visited by the extractor (at least > for count order). For fields with millions of unique tag values this means > 10s of milliseconds added to the minimum response time (see > https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/ > for a test on a corpus with 7M unique values in the facet). > The extractor needs to visit every counter due to the current counter > structure being a plain int-array of size #unique_tags. Switching to a sparse > structure, where only the tag counters > 0 are visited, makes the extraction > time linear to the number of unique tags in the result set. > Unfortunately the number of unique tags in the result set is unknown at > collect time, so it is not possible to reliably select sparse counting vs. > full counting up front. Luckily there exists solutions for sparse sets that > has the property of switching to non-sparse-mode without a switch-penalty, > when the sparse-threshold is exceeded (see > http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This > JIRA aims to implement this functionality in Solr. > Current status: Sparse counting is implemented for field cache faceting, both > single- and multi-value, with and without doc-values. Sort by count only. The > patch applies cleanly to Solr 4.6.1 and should integrate well with everything > as all functionality is unchanged. After patching, the following new > parameters are possible: > * facet.sparse=true enables sparse faceting. > * facet.sparse.mintags=10000 the minimum amount of unique tags in the given > field for sparse faceting to be active. This is used for auto-selecting > whether sparse should be used or not. > * facet.sparse.fraction=0.08 the overhead used for the sparse tracker. > Setting this too low means that only very small result sets are handled as > sparse. Setting this too high will result in a large performance penalty if > the result set blows the sparse tracker. Values between 0.04 and 0.1 seems to > work well. > * facet.sparse.packed=true use PackecInts for counters instead of int[]. This > saves memory, but performance will differ. Whether performance will be better > or worse depends on the corpus. Experiment with it. > * facet.sparse.cutoff=0.90 if the estimated number (based on hitcount) of > unique tags in the search result exceeds this fraction of the sparse tracker, > do not perform sparse tracking. The estimate is based on the assumption that > references from documents to tags are distributed randomly. > * facet.sparse.pool.size=2 the maximum amount of sparse trackers to clear and > keep in memory, ready for usage. Clearing and re-using a counter is faster > that allocating it fresh from the heap. Setting the pool size to 0 means than > a new sparse counter will be allocated each time, just as standard Solr > faceting works. > * facet.sparse.stats=true adds a special tag with timing statistics for > sparse faceting. > * facet.sparse.stats.reset=true resets the timing statistics and clears the > pool. > The parameters needs to be given together with standard faceting parameters, > such as facet=true&facet.field=myfield&facet.mincount=1&facet.sort=true. The > defaults should be usable, so simply appending facet.sparse=true to the URL > is a good start. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org