[ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toke Eskildsen updated SOLR-5894:
---------------------------------

    Attachment: SOLR-5894.patch

Simple proof of concept patch for sparse counting for field-based faceting on 
high-cardinality multi-value fields.

Apply patch, build Solr, open an index with a proper high-cardinality field (> 
100.000 unique values) such as author or title. Standard field based faceting 
works as always.

Activate sparse counting by appending facet.sparse=true to the request. Control 
the amount of extra space used for the sparse structure with 
facet.sparse.fraction (default is 0.025, which seems to work well).

Tested against Solr 4.6.1. Not tested against 4.7.

> Speed up high-cardinality facets with sparse counters
> -----------------------------------------------------
>
>                 Key: SOLR-5894
>                 URL: https://issues.apache.org/jira/browse/SOLR-5894
>             Project: Solr
>          Issue Type: Improvement
>          Components: SearchComponents - other
>    Affects Versions: 4.6.1, 4.7
>            Reporter: Toke Eskildsen
>            Priority: Minor
>             Fix For: 4.6.1
>
>         Attachments: SOLR-5894.patch
>
>
> Field based faceting in Solr has two phases: Collecting counts for tags in 
> facets and extracting the requested tags.
> The execution time for the collecting phase is approximately linear to the 
> number of hits and the number of references from hits to tags. This phase is 
> not the focus here.
> The extraction time scales with the number of unique tags in the search 
> result, but is also heavily influenced by the total number of unique tags in 
> the facet as every counter, 0 or not, is visited by the extractor (at least 
> for count order). For fields with millions of unique tag values this means 
> 10s of milliseconds added to the minimum response time (see 
> https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/
>  for a test on a corpus with 7M unique values in the facet).
> The extractor needs to visit every counter due to the current counter 
> structure being a plain int-array of size #unique_tags. Switching to a sparse 
> structure, where only the tag counters > 0 are visited, makes the extraction 
> time linear to the number of unique tags in the result set.
> Unfortunately the number of unique tags in the result set is unknown at 
> collect time, so it is not possible to reliably select sparse counting vs. 
> full counting up front. Luckily there exists solutions for sparse sets that 
> has the property of switching to non-sparse-mode without a switch-penalty, 
> when the sparse-threshold is exceeded (see 
> http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This 
> JIRA aims to implement this functionality in Solr (a proof of concept patch 
> will be provided shortly).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to