[ https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14116786#comment-14116786 ]
Toke Eskildsen commented on SOLR-5894: -------------------------------------- Status update The code seems to be in working state for Solr 4.8 and 4.9 for Field Cache faceting (single- and multi-value) and DocValue faceting (single- and multi-value) for Strings. It really needs testing by someone else than me, so that the validity of the responses and the speed upc can be verified. I am willing to port this to trunk if a committer is willing to work on getting the patch into Solr, but until then I will stick to stable versions as that is what we use locally. Sparse faceting introduces some architectural changes that needs to be addressed. 1) The core sparse counting itself is surprisingly non-invasive. It could be a standalone patch. However, this only works really well in a non-distributed setting and has less effect in a distributed one. 2) Avoiding re-allocation of counter structures by maintaining a pool of structures lowers the minimum faceting time and lowers the need for GC. It is lowered quite a lot, I might add, as faceting is normally one of the big GC-triggers, so I would strongly recommend this feature. Such a pool is very much like a cache and as such must play nice with index updates and other caching structures in Solr. The current state of the counter pool is that it works well, but is implemented independently of the usual caches. It is probably better to latch on to the standard caching mechanisms in Solr, but I have no experience with that part of the code base. The code for the pool itself is fairly simple and can be implemented as an experimental feature by setting the default pool size to 0. 3) Fast distributed faceting is about speeding up the secondary phase of distrubuted faceting: The fine count for specific terms. This is achieved by replacing the normal search-for-each-requested-term approach by a full faceted count approach, where the counts for all terms are calculated but only the counts for the specified terms are returned. This requires either a separation of counting and extraction in stock Solr code or an extension to the count-extract code block so that is can handle both the case where top-X terms are wanted and where specific terms are requested. Currently the sparse code use the latter approach, but the former is a much cleaner architecture. Both solutions does require some re-factoring and are a bit tricky to get right. 4) A further speed up for distributed faceting can be achieved by upgrading the sparse pools to real caches, so that they contain a mix of empty counters (to avoid GC) and filled counters (caching of counts). As the counting part of the second phase is always exactly the same as for the first phase, cache hit-rate should be very high with distribution. This is not currently implemented and is a bit tricky to get to work well. I recommend that this improvement is postponed, in order to keep the current patch manageable. The sparse principle would also work for Field Cache per Segment faceting of Strings, but no code has been written for this yet. I recommend that this is also postponed, to get things moving. > Speed up high-cardinality facets with sparse counters > ----------------------------------------------------- > > Key: SOLR-5894 > URL: https://issues.apache.org/jira/browse/SOLR-5894 > Project: Solr > Issue Type: Improvement > Components: SearchComponents - other > Affects Versions: 4.7.1 > Reporter: Toke Eskildsen > Priority: Minor > Labels: faceted-search, faceting, memory, performance > Attachments: SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, > SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, > SOLR-5894.patch, SOLR-5894.patch, SOLR-5894_test.zip, SOLR-5894_test.zip, > SOLR-5894_test.zip, SOLR-5894_test.zip, SOLR-5894_test.zip, > author_7M_tags_1852_logged_queries_warmed.png, > sparse_2000000docs_fc_cutoff_20140403-145412.png, > sparse_5000000docs_20140331-151918_multi.png, > sparse_5000000docs_20140331-151918_single.png, > sparse_50510000docs_20140328-152807.png > > > Field based faceting in Solr has two phases: Collecting counts for tags in > facets and extracting the requested tags. > The execution time for the collecting phase is approximately linear to the > number of hits and the number of references from hits to tags. This phase is > not the focus here. > The extraction time scales with the number of unique tags in the search > result, but is also heavily influenced by the total number of unique tags in > the facet as every counter, 0 or not, is visited by the extractor (at least > for count order). For fields with millions of unique tag values this means > 10s of milliseconds added to the minimum response time (see > https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/ > for a test on a corpus with 7M unique values in the facet). > The extractor needs to visit every counter due to the current counter > structure being a plain int-array of size #unique_tags. Switching to a sparse > structure, where only the tag counters > 0 are visited, makes the extraction > time linear to the number of unique tags in the result set. > Unfortunately the number of unique tags in the result set is unknown at > collect time, so it is not possible to reliably select sparse counting vs. > full counting up front. Luckily there exists solutions for sparse sets that > has the property of switching to non-sparse-mode without a switch-penalty, > when the sparse-threshold is exceeded (see > http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This > JIRA aims to implement this functionality in Solr. > Current status: Sparse counting is implemented for field cache faceting, both > single- and multi-value, with and without doc-values. Sort by count only. The > patch applies cleanly to Solr 4.6.1 and should integrate well with everything > as all functionality is unchanged. After patching, the following new > parameters are possible: > * facet.sparse=true enables sparse faceting. > * facet.sparse.mintags=10000 the minimum amount of unique tags in the given > field for sparse faceting to be active. This is used for auto-selecting > whether sparse should be used or not. > * facet.sparse.fraction=0.08 the overhead used for the sparse tracker. > Setting this too low means that only very small result sets are handled as > sparse. Setting this too high will result in a large performance penalty if > the result set blows the sparse tracker. Values between 0.04 and 0.1 seems to > work well. > * facet.sparse.packed=true use PackecInts for counters instead of int[]. This > saves memory, but performance will differ. Whether performance will be better > or worse depends on the corpus. Experiment with it. > * facet.sparse.cutoff=0.90 if the estimated number (based on hitcount) of > unique tags in the search result exceeds this fraction of the sparse tracker, > do not perform sparse tracking. The estimate is based on the assumption that > references from documents to tags are distributed randomly. > * facet.sparse.pool.size=2 the maximum amount of sparse trackers to clear and > keep in memory, ready for usage. Clearing and re-using a counter is faster > that allocating it fresh from the heap. Setting the pool size to 0 means than > a new sparse counter will be allocated each time, just as standard Solr > faceting works. > * facet.sparse.stats=true adds a special tag with timing statistics for > sparse faceting. > * facet.sparse.stats.reset=true resets the timing statistics and clears the > pool. > The parameters needs to be given together with standard faceting parameters, > such as facet=true&facet.field=myfield&facet.mincount=1&facet.sort=true. The > defaults should be usable, so simply appending facet.sparse=true to the URL > is a good start. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org