[ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toke Eskildsen updated SOLR-5894:
---------------------------------

    Description: 
Field based faceting in Solr has two phases: Collecting counts for tags in 
facets and extracting the requested tags.

The execution time for the collecting phase is approximately linear to the 
number of hits and the number of references from hits to tags. This phase is 
not the focus here.

The extraction time scales with the number of unique tags in the search result, 
but is also heavily influenced by the total number of unique tags in the facet 
as every counter, 0 or not, is visited by the extractor (at least for count 
order). For fields with millions of unique tag values this means 10s of 
milliseconds added to the minimum response time (see 
https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/ 
for a test on a corpus with 7M unique values in the facet).

The extractor needs to visit every counter due to the current counter structure 
being a plain int-array of size #unique_tags. Switching to a sparse structure, 
where only the tag counters > 0 are visited, makes the extraction time linear 
to the number of unique tags in the result set.

Unfortunately the number of unique tags in the result set is unknown at collect 
time, so it is not possible to reliably select sparse counting vs. full 
counting up front. Luckily there exists solutions for sparse sets that has the 
property of switching to non-sparse-mode without a switch-penalty, when the 
sparse-threshold is exceeded (see 
http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This JIRA 
aims to implement this functionality in Solr.

Current status: Sparse counting is implemented for field cache faceting, both 
single- and multi-value, with and without doc-values. Sort by count only. The 
patch applies cleanly to Solr 4.6.1 and should integrate well with everything 
as all functionality is unchanged. After patching, the following new parameters 
are possible:

* facet.sparse=true enables sparse faceting.
* facet.sparse.mintags=10000 the minimum amount of unique tags in the given 
field for sparse faceting to be active. This is used for auto-selecting whether 
sparse should be used or not.
* facet.sparse.fraction=0.08 the overhead used for the sparse tracker. Setting 
this too low means that only very small result sets are handled as sparse. 
Setting this too high will result in a large performance penalty if the result 
set blows the sparse tracker. Values between 0.04 and 0.1 seems to work well.
* facet.sparse.pool.size=2 the maximum amount of sparse trackers to clear and 
keep in memory, ready for usage. Clearing and re-using a counter is faster that 
allocating it fresh from the heap. Setting the pool size to 0 means than a new 
sparse counter will be allocated each time, just as standard Solr faceting 
works.

* facet.sparse.stats=true adds a special tag with timing statistics for sparse 
faceting.
* facet.sparse.stats.reset=true resets the timing statistics and clears the 
pool.

The parameters needs to be given together with standard faceting parameters, 
such as facet=true&facet.field=myfield&facet.mincount=1&facet.sort=true. The 
defaults should be usable, so simply appending facet.sparse=true to the URL is 
a good start.

  was:
Field based faceting in Solr has two phases: Collecting counts for tags in 
facets and extracting the requested tags.

The execution time for the collecting phase is approximately linear to the 
number of hits and the number of references from hits to tags. This phase is 
not the focus here.

The extraction time scales with the number of unique tags in the search result, 
but is also heavily influenced by the total number of unique tags in the facet 
as every counter, 0 or not, is visited by the extractor (at least for count 
order). For fields with millions of unique tag values this means 10s of 
milliseconds added to the minimum response time (see 
https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/ 
for a test on a corpus with 7M unique values in the facet).

The extractor needs to visit every counter due to the current counter structure 
being a plain int-array of size #unique_tags. Switching to a sparse structure, 
where only the tag counters > 0 are visited, makes the extraction time linear 
to the number of unique tags in the result set.

Unfortunately the number of unique tags in the result set is unknown at collect 
time, so it is not possible to reliably select sparse counting vs. full 
counting up front. Luckily there exists solutions for sparse sets that has the 
property of switching to non-sparse-mode without a switch-penalty, when the 
sparse-threshold is exceeded (see 
http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This JIRA 
aims to implement this functionality in Solr.

Current status: Sparse counting is implemented for field cache faceting, both 
single- and multi-value, with and without doc-values. The patch applies cleanly 
to Solr 4.6.1 and should integrate well with everything as all functionality is 
unchanged. After patching, the following new parameters are possible:




> Speed up high-cardinality facets with sparse counters
> -----------------------------------------------------
>
>                 Key: SOLR-5894
>                 URL: https://issues.apache.org/jira/browse/SOLR-5894
>             Project: Solr
>          Issue Type: Improvement
>          Components: SearchComponents - other
>    Affects Versions: 4.6.1, 4.7
>            Reporter: Toke Eskildsen
>            Priority: Minor
>             Fix For: 4.6.1
>
>         Attachments: SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894_test.zip, SOLR-5894_test.zip, 
> SOLR-5894_test.zip, author_7M_tags_1852_logged_queries_warmed.png, 
> sparse_5000000docs_20140331-151918_multi.png, 
> sparse_5000000docs_20140331-151918_single.png, 
> sparse_50510000docs_20140328-152807.png
>
>
> Field based faceting in Solr has two phases: Collecting counts for tags in 
> facets and extracting the requested tags.
> The execution time for the collecting phase is approximately linear to the 
> number of hits and the number of references from hits to tags. This phase is 
> not the focus here.
> The extraction time scales with the number of unique tags in the search 
> result, but is also heavily influenced by the total number of unique tags in 
> the facet as every counter, 0 or not, is visited by the extractor (at least 
> for count order). For fields with millions of unique tag values this means 
> 10s of milliseconds added to the minimum response time (see 
> https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/
>  for a test on a corpus with 7M unique values in the facet).
> The extractor needs to visit every counter due to the current counter 
> structure being a plain int-array of size #unique_tags. Switching to a sparse 
> structure, where only the tag counters > 0 are visited, makes the extraction 
> time linear to the number of unique tags in the result set.
> Unfortunately the number of unique tags in the result set is unknown at 
> collect time, so it is not possible to reliably select sparse counting vs. 
> full counting up front. Luckily there exists solutions for sparse sets that 
> has the property of switching to non-sparse-mode without a switch-penalty, 
> when the sparse-threshold is exceeded (see 
> http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This 
> JIRA aims to implement this functionality in Solr.
> Current status: Sparse counting is implemented for field cache faceting, both 
> single- and multi-value, with and without doc-values. Sort by count only. The 
> patch applies cleanly to Solr 4.6.1 and should integrate well with everything 
> as all functionality is unchanged. After patching, the following new 
> parameters are possible:
> * facet.sparse=true enables sparse faceting.
> * facet.sparse.mintags=10000 the minimum amount of unique tags in the given 
> field for sparse faceting to be active. This is used for auto-selecting 
> whether sparse should be used or not.
> * facet.sparse.fraction=0.08 the overhead used for the sparse tracker. 
> Setting this too low means that only very small result sets are handled as 
> sparse. Setting this too high will result in a large performance penalty if 
> the result set blows the sparse tracker. Values between 0.04 and 0.1 seems to 
> work well.
> * facet.sparse.pool.size=2 the maximum amount of sparse trackers to clear and 
> keep in memory, ready for usage. Clearing and re-using a counter is faster 
> that allocating it fresh from the heap. Setting the pool size to 0 means than 
> a new sparse counter will be allocated each time, just as standard Solr 
> faceting works.
> * facet.sparse.stats=true adds a special tag with timing statistics for 
> sparse faceting.
> * facet.sparse.stats.reset=true resets the timing statistics and clears the 
> pool.
> The parameters needs to be given together with standard faceting parameters, 
> such as facet=true&facet.field=myfield&facet.mincount=1&facet.sort=true. The 
> defaults should be usable, so simply appending facet.sparse=true to the URL 
> is a good start.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to