[jira] [Updated] (SOLR-5894) Speed up high-cardinality facets with sparse counters

Toke Eskildsen (JIRA) Fri, 28 Mar 2014 08:06:30 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Toke Eskildsen updated SOLR-5894:
---------------------------------

    Attachment: SOLR-5894_test.zip
                sparse_50510000docs_20140328-152807.png
                SOLR-5894.patch

Patch with several bugfixes and updated test scripts.

The chart sparse_50510000docs_20140328-152807.png shows results form a test run 
on 50M documents / 150M unique tags. The black line is standard Solr, the red 
is a sparse counter with 100% overhead, the cyan is an attempt of compromising 
with 8% sparse overhead.

The performance hit for passing the 8% cut-off is huge. One way to avoid it 
could be to make a numhits-based guess of the expected number of unique tags, 
when choosing facet implementation. The consequence of a wrong guess would 
"just" be poorer performance, with the worst-case being the max of the black 
and the cyan line for any given number of unique tags in the search result.

> Speed up high-cardinality facets with sparse counters
> -----------------------------------------------------
>
>                 Key: SOLR-5894
>                 URL: https://issues.apache.org/jira/browse/SOLR-5894
>             Project: Solr
>          Issue Type: Improvement
>          Components: SearchComponents - other
>    Affects Versions: 4.6.1, 4.7
>            Reporter: Toke Eskildsen
>            Priority: Minor
>             Fix For: 4.6.1
>
>         Attachments: SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894_test.zip, SOLR-5894_test.zip, 
> author_7M_tags_1852_logged_queries_warmed.png, 
> sparse_50510000docs_20140328-152807.png
>
>
> Field based faceting in Solr has two phases: Collecting counts for tags in 
> facets and extracting the requested tags.
> The execution time for the collecting phase is approximately linear to the 
> number of hits and the number of references from hits to tags. This phase is 
> not the focus here.
> The extraction time scales with the number of unique tags in the search 
> result, but is also heavily influenced by the total number of unique tags in 
> the facet as every counter, 0 or not, is visited by the extractor (at least 
> for count order). For fields with millions of unique tag values this means 
> 10s of milliseconds added to the minimum response time (see 
> https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/
>  for a test on a corpus with 7M unique values in the facet).
> The extractor needs to visit every counter due to the current counter 
> structure being a plain int-array of size #unique_tags. Switching to a sparse 
> structure, where only the tag counters > 0 are visited, makes the extraction 
> time linear to the number of unique tags in the result set.
> Unfortunately the number of unique tags in the result set is unknown at 
> collect time, so it is not possible to reliably select sparse counting vs. 
> full counting up front. Luckily there exists solutions for sparse sets that 
> has the property of switching to non-sparse-mode without a switch-penalty, 
> when the sparse-threshold is exceeded (see 
> http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This 
> JIRA aims to implement this functionality in Solr (a proof of concept patch 
> will be provided shortly).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-5894) Speed up high-cardinality facets with sparse counters

Reply via email to