[ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14105715#comment-14105715
 ] 

Toke Eskildsen commented on SOLR-5894:
--------------------------------------

Forked Lucene/Solr on GitHub for better project management: 
https://github.com/tokee/lucene-solr . Experimental branches for sparse 
faceting are lucene_solr_4_8_sparse and lucene_solr_4_9_sparse and will contain 
the latest code. The patch above for 4.7.1 is a little behind the GitHub 
branches, as it does not have speed up for second phase (getting counts for 
specific terms) of distributed faceting.

> Speed up high-cardinality facets with sparse counters
> -----------------------------------------------------
>
>                 Key: SOLR-5894
>                 URL: https://issues.apache.org/jira/browse/SOLR-5894
>             Project: Solr
>          Issue Type: Improvement
>          Components: SearchComponents - other
>    Affects Versions: 4.7.1
>            Reporter: Toke Eskildsen
>            Priority: Minor
>              Labels: faceted-search, faceting, memory, performance
>         Attachments: SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894_test.zip, SOLR-5894_test.zip, 
> SOLR-5894_test.zip, SOLR-5894_test.zip, SOLR-5894_test.zip, 
> author_7M_tags_1852_logged_queries_warmed.png, 
> sparse_2000000docs_fc_cutoff_20140403-145412.png, 
> sparse_5000000docs_20140331-151918_multi.png, 
> sparse_5000000docs_20140331-151918_single.png, 
> sparse_50510000docs_20140328-152807.png
>
>
> Field based faceting in Solr has two phases: Collecting counts for tags in 
> facets and extracting the requested tags.
> The execution time for the collecting phase is approximately linear to the 
> number of hits and the number of references from hits to tags. This phase is 
> not the focus here.
> The extraction time scales with the number of unique tags in the search 
> result, but is also heavily influenced by the total number of unique tags in 
> the facet as every counter, 0 or not, is visited by the extractor (at least 
> for count order). For fields with millions of unique tag values this means 
> 10s of milliseconds added to the minimum response time (see 
> https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/
>  for a test on a corpus with 7M unique values in the facet).
> The extractor needs to visit every counter due to the current counter 
> structure being a plain int-array of size #unique_tags. Switching to a sparse 
> structure, where only the tag counters > 0 are visited, makes the extraction 
> time linear to the number of unique tags in the result set.
> Unfortunately the number of unique tags in the result set is unknown at 
> collect time, so it is not possible to reliably select sparse counting vs. 
> full counting up front. Luckily there exists solutions for sparse sets that 
> has the property of switching to non-sparse-mode without a switch-penalty, 
> when the sparse-threshold is exceeded (see 
> http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This 
> JIRA aims to implement this functionality in Solr.
> Current status: Sparse counting is implemented for field cache faceting, both 
> single- and multi-value, with and without doc-values. Sort by count only. The 
> patch applies cleanly to Solr 4.6.1 and should integrate well with everything 
> as all functionality is unchanged. After patching, the following new 
> parameters are possible:
> * facet.sparse=true enables sparse faceting.
> * facet.sparse.mintags=10000 the minimum amount of unique tags in the given 
> field for sparse faceting to be active. This is used for auto-selecting 
> whether sparse should be used or not.
> * facet.sparse.fraction=0.08 the overhead used for the sparse tracker. 
> Setting this too low means that only very small result sets are handled as 
> sparse. Setting this too high will result in a large performance penalty if 
> the result set blows the sparse tracker. Values between 0.04 and 0.1 seems to 
> work well.
> * facet.sparse.packed=true use PackecInts for counters instead of int[]. This 
> saves memory, but performance will differ. Whether performance will be better 
> or worse depends on the corpus. Experiment with it.
> * facet.sparse.cutoff=0.90 if the estimated number (based on hitcount) of 
> unique tags in the search result exceeds this fraction of the sparse tracker, 
> do not perform sparse tracking. The estimate is based on the assumption that 
> references from documents to tags are distributed randomly.
> * facet.sparse.pool.size=2 the maximum amount of sparse trackers to clear and 
> keep in memory, ready for usage. Clearing and re-using a counter is faster 
> that allocating it fresh from the heap. Setting the pool size to 0 means than 
> a new sparse counter will be allocated each time, just as standard Solr 
> faceting works.
> * facet.sparse.stats=true adds a special tag with timing statistics for 
> sparse faceting.
> * facet.sparse.stats.reset=true resets the timing statistics and clears the 
> pool.
> The parameters needs to be given together with standard faceting parameters, 
> such as facet=true&facet.field=myfield&facet.mincount=1&facet.sort=true. The 
> defaults should be usable, so simply appending facet.sparse=true to the URL 
> is a good start.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to