[jira] [Updated] (SOLR-5894) Speed up high-cardinality facets with sparse counters

2015-07-30 Thread Toke Eskildsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toke Eskildsen updated SOLR-5894:
-
Description: 
Multiple performance enhancements to Solr String faceting.

* Sparse counters, switching the constant time overhead of extracting top-X 
terms with time overhead linear to result set size
* Counter re-use for reduced garbage collection and lower per-call overhead
* Optional counter packing, trading speed for space
* Improved distribution count logic, greatly improving the performance of 
distributed faceting
* In-segment threaded faceting
* Regexp based white- and black-listing of facet terms
* Heuristic faceting for large result sets

Currently implemented for Solr 4.10. Source, detailed description and directly 
usable WAR at http://tokee.github.io/lucene-solr/

This project has grown beyond a simple patch and will require a fair amount of 
co-operation with a committer to get into Solr. Splitting into smaller issues 
is a possibility.

  was:
Multiple performance enhancements to Solr String faceting.

* Sparse counters, switching the constant time overhead of extracting top-X 
terms with time linear to result set size
* Counter re-use for reduced garbage collection and lower per-call overhead
* Optional counter packing, trading speed for space
* Improved distribution count logic, greatly reducing the overhead of 
distributed faceting
* In-segment threaded faceting
* Regexp based white- and black-listing of facet terms
* Heuristic faceting for large result sets

Currently implemented for Solr 4.10 and available as source and directly usable 
WAR at http://tokee.github.io/lucene-solr/




> Speed up high-cardinality facets with sparse counters
> -
>
> Key: SOLR-5894
> URL: https://issues.apache.org/jira/browse/SOLR-5894
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 4.7.1
>Reporter: Toke Eskildsen
>Priority: Minor
>  Labels: faceted-search, faceting, memory, performance
> Attachments: SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894_test.zip, SOLR-5894_test.zip, 
> SOLR-5894_test.zip, SOLR-5894_test.zip, SOLR-5894_test.zip, 
> author_7M_tags_1852_logged_queries_warmed.png, 
> sparse_200docs_fc_cutoff_20140403-145412.png, 
> sparse_500docs_20140331-151918_multi.png, 
> sparse_500docs_20140331-151918_single.png, 
> sparse_5051docs_20140328-152807.png
>
>
> Multiple performance enhancements to Solr String faceting.
> * Sparse counters, switching the constant time overhead of extracting top-X 
> terms with time overhead linear to result set size
> * Counter re-use for reduced garbage collection and lower per-call overhead
> * Optional counter packing, trading speed for space
> * Improved distribution count logic, greatly improving the performance of 
> distributed faceting
> * In-segment threaded faceting
> * Regexp based white- and black-listing of facet terms
> * Heuristic faceting for large result sets
> Currently implemented for Solr 4.10. Source, detailed description and 
> directly usable WAR at http://tokee.github.io/lucene-solr/
> This project has grown beyond a simple patch and will require a fair amount 
> of co-operation with a committer to get into Solr. Splitting into smaller 
> issues is a possibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-5894) Speed up high-cardinality facets with sparse counters

2015-07-30 Thread Toke Eskildsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toke Eskildsen updated SOLR-5894:
-
Description: 
Multiple performance enhancements to Solr String faceting.

* Sparse counters, switching the constant time overhead of extracting top-X 
terms with time linear to result set size
* Counter re-use for reduced garbage collection and lower per-call overhead
* Optional counter packing, trading speed for space
* Improved distribution count logic, greatly reducing the overhead of 
distributed faceting
* In-segment threaded faceting
* Regexp based white- and black-listing of facet terms
* Heuristic faceting for large result sets

Currently implemented for Solr 4.10 and available as source and directly usable 
WAR at http://tokee.github.io/lucene-solr/



  was:
Field based faceting in Solr has two phases: Collecting counts for tags in 
facets and extracting the requested tags.

The execution time for the collecting phase is approximately linear to the 
number of hits and the number of references from hits to tags. This phase is 
not the focus here.

The extraction time scales with the number of unique tags in the search result, 
but is also heavily influenced by the total number of unique tags in the facet 
as every counter, 0 or not, is visited by the extractor (at least for count 
order). For fields with millions of unique tag values this means 10s of 
milliseconds added to the minimum response time (see 
https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/ 
for a test on a corpus with 7M unique values in the facet).

The extractor needs to visit every counter due to the current counter structure 
being a plain int-array of size #unique_tags. Switching to a sparse structure, 
where only the tag counters > 0 are visited, makes the extraction time linear 
to the number of unique tags in the result set.

Unfortunately the number of unique tags in the result set is unknown at collect 
time, so it is not possible to reliably select sparse counting vs. full 
counting up front. Luckily there exists solutions for sparse sets that has the 
property of switching to non-sparse-mode without a switch-penalty, when the 
sparse-threshold is exceeded (see 
http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This JIRA 
aims to implement this functionality in Solr.

Current status: Sparse counting is implemented for field cache faceting, both 
single- and multi-value, with and without doc-values. Sort by count only. The 
patch applies cleanly to Solr 4.6.1 and should integrate well with everything 
as all functionality is unchanged. After patching, the following new parameters 
are possible:

* facet.sparse=true enables sparse faceting.
* facet.sparse.mintags=1 the minimum amount of unique tags in the given 
field for sparse faceting to be active. This is used for auto-selecting whether 
sparse should be used or not.
* facet.sparse.fraction=0.08 the overhead used for the sparse tracker. Setting 
this too low means that only very small result sets are handled as sparse. 
Setting this too high will result in a large performance penalty if the result 
set blows the sparse tracker. Values between 0.04 and 0.1 seems to work well.
* facet.sparse.packed=true use PackecInts for counters instead of int[]. This 
saves memory, but performance will differ. Whether performance will be better 
or worse depends on the corpus. Experiment with it.
* facet.sparse.cutoff=0.90 if the estimated number (based on hitcount) of 
unique tags in the search result exceeds this fraction of the sparse tracker, 
do not perform sparse tracking. The estimate is based on the assumption that 
references from documents to tags are distributed randomly.
* facet.sparse.pool.size=2 the maximum amount of sparse trackers to clear and 
keep in memory, ready for usage. Clearing and re-using a counter is faster that 
allocating it fresh from the heap. Setting the pool size to 0 means than a new 
sparse counter will be allocated each time, just as standard Solr faceting 
works.

* facet.sparse.stats=true adds a special tag with timing statistics for sparse 
faceting.
* facet.sparse.stats.reset=true resets the timing statistics and clears the 
pool.

The parameters needs to be given together with standard faceting parameters, 
such as facet=true&facet.field=myfield&facet.mincount=1&facet.sort=true. The 
defaults should be usable, so simply appending facet.sparse=true to the URL is 
a good start.


> Speed up high-cardinality facets with sparse counters
> -
>
> Key: SOLR-5894
> URL: https://issues.apache.org/jira/browse/SOLR-5894
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 4.7.1
>Reporter: Toke Eskildsen
>Priority: Minor
> 

[jira] [Updated] (SOLR-5894) Speed up high-cardinality facets with sparse counters

2014-07-03 Thread Toke Eskildsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toke Eskildsen updated SOLR-5894:
-

Attachment: SOLR-5894.patch

Changed the caching of counters to be field-specific, which works a lot better 
for multi-field faceting requests. Introduced facet.sparse.maxtracked to limit 
the maximum count for any facet value: This limits memory consumption at the 
cost of accuracy.

Patch is for Solr 4.7.1.

> Speed up high-cardinality facets with sparse counters
> -
>
> Key: SOLR-5894
> URL: https://issues.apache.org/jira/browse/SOLR-5894
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 4.7.1
>Reporter: Toke Eskildsen
>Priority: Minor
>  Labels: faceted-search, faceting, memory, performance
> Attachments: SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894_test.zip, SOLR-5894_test.zip, 
> SOLR-5894_test.zip, SOLR-5894_test.zip, SOLR-5894_test.zip, 
> author_7M_tags_1852_logged_queries_warmed.png, 
> sparse_200docs_fc_cutoff_20140403-145412.png, 
> sparse_500docs_20140331-151918_multi.png, 
> sparse_500docs_20140331-151918_single.png, 
> sparse_5051docs_20140328-152807.png
>
>
> Field based faceting in Solr has two phases: Collecting counts for tags in 
> facets and extracting the requested tags.
> The execution time for the collecting phase is approximately linear to the 
> number of hits and the number of references from hits to tags. This phase is 
> not the focus here.
> The extraction time scales with the number of unique tags in the search 
> result, but is also heavily influenced by the total number of unique tags in 
> the facet as every counter, 0 or not, is visited by the extractor (at least 
> for count order). For fields with millions of unique tag values this means 
> 10s of milliseconds added to the minimum response time (see 
> https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/
>  for a test on a corpus with 7M unique values in the facet).
> The extractor needs to visit every counter due to the current counter 
> structure being a plain int-array of size #unique_tags. Switching to a sparse 
> structure, where only the tag counters > 0 are visited, makes the extraction 
> time linear to the number of unique tags in the result set.
> Unfortunately the number of unique tags in the result set is unknown at 
> collect time, so it is not possible to reliably select sparse counting vs. 
> full counting up front. Luckily there exists solutions for sparse sets that 
> has the property of switching to non-sparse-mode without a switch-penalty, 
> when the sparse-threshold is exceeded (see 
> http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This 
> JIRA aims to implement this functionality in Solr.
> Current status: Sparse counting is implemented for field cache faceting, both 
> single- and multi-value, with and without doc-values. Sort by count only. The 
> patch applies cleanly to Solr 4.6.1 and should integrate well with everything 
> as all functionality is unchanged. After patching, the following new 
> parameters are possible:
> * facet.sparse=true enables sparse faceting.
> * facet.sparse.mintags=1 the minimum amount of unique tags in the given 
> field for sparse faceting to be active. This is used for auto-selecting 
> whether sparse should be used or not.
> * facet.sparse.fraction=0.08 the overhead used for the sparse tracker. 
> Setting this too low means that only very small result sets are handled as 
> sparse. Setting this too high will result in a large performance penalty if 
> the result set blows the sparse tracker. Values between 0.04 and 0.1 seems to 
> work well.
> * facet.sparse.packed=true use PackecInts for counters instead of int[]. This 
> saves memory, but performance will differ. Whether performance will be better 
> or worse depends on the corpus. Experiment with it.
> * facet.sparse.cutoff=0.90 if the estimated number (based on hitcount) of 
> unique tags in the search result exceeds this fraction of the sparse tracker, 
> do not perform sparse tracking. The estimate is based on the assumption that 
> references from documents to tags are distributed randomly.
> * facet.sparse.pool.size=2 the maximum amount of sparse trackers to clear and 
> keep in memory, ready for usage. Clearing and re-using a counter is faster 
> that allocating it fresh from the heap. Setting the pool size to 0 means than 
> a new sparse counter will be allocated each time, just as standard Solr 
> faceting works.
> * facet.sparse.stats=true adds a special tag with timing statistics for 
> sparse faceting.
> *

[jira] [Updated] (SOLR-5894) Speed up high-cardinality facets with sparse counters

2014-06-18 Thread Toke Eskildsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toke Eskildsen updated SOLR-5894:
-

Attachment: SOLR-5894.patch

Changed mincount in distributed faceting to not force mincount=0, resulting in 
sparse faceting also bringing speedup to distributed faceting.

Added support for maxTracked, optionally setting a ceiling for the count for 
any term in a given facet, thereby lowering memory requirements and increasing 
speed, at the cost of accuracy.

Still on Solr 4.7.1.

> Speed up high-cardinality facets with sparse counters
> -
>
> Key: SOLR-5894
> URL: https://issues.apache.org/jira/browse/SOLR-5894
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 4.7.1
>Reporter: Toke Eskildsen
>Priority: Minor
>  Labels: faceted-search, faceting, memory, performance
> Attachments: SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894_test.zip, SOLR-5894_test.zip, SOLR-5894_test.zip, 
> SOLR-5894_test.zip, SOLR-5894_test.zip, 
> author_7M_tags_1852_logged_queries_warmed.png, 
> sparse_200docs_fc_cutoff_20140403-145412.png, 
> sparse_500docs_20140331-151918_multi.png, 
> sparse_500docs_20140331-151918_single.png, 
> sparse_5051docs_20140328-152807.png
>
>
> Field based faceting in Solr has two phases: Collecting counts for tags in 
> facets and extracting the requested tags.
> The execution time for the collecting phase is approximately linear to the 
> number of hits and the number of references from hits to tags. This phase is 
> not the focus here.
> The extraction time scales with the number of unique tags in the search 
> result, but is also heavily influenced by the total number of unique tags in 
> the facet as every counter, 0 or not, is visited by the extractor (at least 
> for count order). For fields with millions of unique tag values this means 
> 10s of milliseconds added to the minimum response time (see 
> https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/
>  for a test on a corpus with 7M unique values in the facet).
> The extractor needs to visit every counter due to the current counter 
> structure being a plain int-array of size #unique_tags. Switching to a sparse 
> structure, where only the tag counters > 0 are visited, makes the extraction 
> time linear to the number of unique tags in the result set.
> Unfortunately the number of unique tags in the result set is unknown at 
> collect time, so it is not possible to reliably select sparse counting vs. 
> full counting up front. Luckily there exists solutions for sparse sets that 
> has the property of switching to non-sparse-mode without a switch-penalty, 
> when the sparse-threshold is exceeded (see 
> http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This 
> JIRA aims to implement this functionality in Solr.
> Current status: Sparse counting is implemented for field cache faceting, both 
> single- and multi-value, with and without doc-values. Sort by count only. The 
> patch applies cleanly to Solr 4.6.1 and should integrate well with everything 
> as all functionality is unchanged. After patching, the following new 
> parameters are possible:
> * facet.sparse=true enables sparse faceting.
> * facet.sparse.mintags=1 the minimum amount of unique tags in the given 
> field for sparse faceting to be active. This is used for auto-selecting 
> whether sparse should be used or not.
> * facet.sparse.fraction=0.08 the overhead used for the sparse tracker. 
> Setting this too low means that only very small result sets are handled as 
> sparse. Setting this too high will result in a large performance penalty if 
> the result set blows the sparse tracker. Values between 0.04 and 0.1 seems to 
> work well.
> * facet.sparse.packed=true use PackecInts for counters instead of int[]. This 
> saves memory, but performance will differ. Whether performance will be better 
> or worse depends on the corpus. Experiment with it.
> * facet.sparse.cutoff=0.90 if the estimated number (based on hitcount) of 
> unique tags in the search result exceeds this fraction of the sparse tracker, 
> do not perform sparse tracking. The estimate is based on the assumption that 
> references from documents to tags are distributed randomly.
> * facet.sparse.pool.size=2 the maximum amount of sparse trackers to clear and 
> keep in memory, ready for usage. Clearing and re-using a counter is faster 
> that allocating it fresh from the heap. Setting the pool size to 0 means than 
> a new sparse counter will be allocated each time, just as standard Solr 
> faceting works.
> * facet.sparse.stats=true adds a speci

[jira] [Updated] (SOLR-5894) Speed up high-cardinality facets with sparse counters

2014-04-17 Thread Toke Eskildsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toke Eskildsen updated SOLR-5894:
-

Labels: faceted-search faceting memory performance  (was: )

> Speed up high-cardinality facets with sparse counters
> -
>
> Key: SOLR-5894
> URL: https://issues.apache.org/jira/browse/SOLR-5894
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 4.7.1
>Reporter: Toke Eskildsen
>Priority: Minor
>  Labels: faceted-search, faceting, memory, performance
> Attachments: SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894_test.zip, SOLR-5894_test.zip, SOLR-5894_test.zip, 
> SOLR-5894_test.zip, SOLR-5894_test.zip, 
> author_7M_tags_1852_logged_queries_warmed.png, 
> sparse_200docs_fc_cutoff_20140403-145412.png, 
> sparse_500docs_20140331-151918_multi.png, 
> sparse_500docs_20140331-151918_single.png, 
> sparse_5051docs_20140328-152807.png
>
>
> Field based faceting in Solr has two phases: Collecting counts for tags in 
> facets and extracting the requested tags.
> The execution time for the collecting phase is approximately linear to the 
> number of hits and the number of references from hits to tags. This phase is 
> not the focus here.
> The extraction time scales with the number of unique tags in the search 
> result, but is also heavily influenced by the total number of unique tags in 
> the facet as every counter, 0 or not, is visited by the extractor (at least 
> for count order). For fields with millions of unique tag values this means 
> 10s of milliseconds added to the minimum response time (see 
> https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/
>  for a test on a corpus with 7M unique values in the facet).
> The extractor needs to visit every counter due to the current counter 
> structure being a plain int-array of size #unique_tags. Switching to a sparse 
> structure, where only the tag counters > 0 are visited, makes the extraction 
> time linear to the number of unique tags in the result set.
> Unfortunately the number of unique tags in the result set is unknown at 
> collect time, so it is not possible to reliably select sparse counting vs. 
> full counting up front. Luckily there exists solutions for sparse sets that 
> has the property of switching to non-sparse-mode without a switch-penalty, 
> when the sparse-threshold is exceeded (see 
> http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This 
> JIRA aims to implement this functionality in Solr.
> Current status: Sparse counting is implemented for field cache faceting, both 
> single- and multi-value, with and without doc-values. Sort by count only. The 
> patch applies cleanly to Solr 4.6.1 and should integrate well with everything 
> as all functionality is unchanged. After patching, the following new 
> parameters are possible:
> * facet.sparse=true enables sparse faceting.
> * facet.sparse.mintags=1 the minimum amount of unique tags in the given 
> field for sparse faceting to be active. This is used for auto-selecting 
> whether sparse should be used or not.
> * facet.sparse.fraction=0.08 the overhead used for the sparse tracker. 
> Setting this too low means that only very small result sets are handled as 
> sparse. Setting this too high will result in a large performance penalty if 
> the result set blows the sparse tracker. Values between 0.04 and 0.1 seems to 
> work well.
> * facet.sparse.packed=true use PackecInts for counters instead of int[]. This 
> saves memory, but performance will differ. Whether performance will be better 
> or worse depends on the corpus. Experiment with it.
> * facet.sparse.cutoff=0.90 if the estimated number (based on hitcount) of 
> unique tags in the search result exceeds this fraction of the sparse tracker, 
> do not perform sparse tracking. The estimate is based on the assumption that 
> references from documents to tags are distributed randomly.
> * facet.sparse.pool.size=2 the maximum amount of sparse trackers to clear and 
> keep in memory, ready for usage. Clearing and re-using a counter is faster 
> that allocating it fresh from the heap. Setting the pool size to 0 means than 
> a new sparse counter will be allocated each time, just as standard Solr 
> faceting works.
> * facet.sparse.stats=true adds a special tag with timing statistics for 
> sparse faceting.
> * facet.sparse.stats.reset=true resets the timing statistics and clears the 
> pool.
> The parameters needs to be given together with standard faceting parameters, 
> such as facet=true&facet.field=myfield&facet.mincount=1&facet.sort=true. The 
> defaults should be usable, so simply ap

[jira] [Updated] (SOLR-5894) Speed up high-cardinality facets with sparse counters

2014-04-17 Thread Toke Eskildsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toke Eskildsen updated SOLR-5894:
-

Fix Version/s: (was: 4.7.1)

> Speed up high-cardinality facets with sparse counters
> -
>
> Key: SOLR-5894
> URL: https://issues.apache.org/jira/browse/SOLR-5894
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 4.7.1
>Reporter: Toke Eskildsen
>Priority: Minor
>  Labels: faceted-search, faceting, memory, performance
> Attachments: SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894_test.zip, SOLR-5894_test.zip, SOLR-5894_test.zip, 
> SOLR-5894_test.zip, SOLR-5894_test.zip, 
> author_7M_tags_1852_logged_queries_warmed.png, 
> sparse_200docs_fc_cutoff_20140403-145412.png, 
> sparse_500docs_20140331-151918_multi.png, 
> sparse_500docs_20140331-151918_single.png, 
> sparse_5051docs_20140328-152807.png
>
>
> Field based faceting in Solr has two phases: Collecting counts for tags in 
> facets and extracting the requested tags.
> The execution time for the collecting phase is approximately linear to the 
> number of hits and the number of references from hits to tags. This phase is 
> not the focus here.
> The extraction time scales with the number of unique tags in the search 
> result, but is also heavily influenced by the total number of unique tags in 
> the facet as every counter, 0 or not, is visited by the extractor (at least 
> for count order). For fields with millions of unique tag values this means 
> 10s of milliseconds added to the minimum response time (see 
> https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/
>  for a test on a corpus with 7M unique values in the facet).
> The extractor needs to visit every counter due to the current counter 
> structure being a plain int-array of size #unique_tags. Switching to a sparse 
> structure, where only the tag counters > 0 are visited, makes the extraction 
> time linear to the number of unique tags in the result set.
> Unfortunately the number of unique tags in the result set is unknown at 
> collect time, so it is not possible to reliably select sparse counting vs. 
> full counting up front. Luckily there exists solutions for sparse sets that 
> has the property of switching to non-sparse-mode without a switch-penalty, 
> when the sparse-threshold is exceeded (see 
> http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This 
> JIRA aims to implement this functionality in Solr.
> Current status: Sparse counting is implemented for field cache faceting, both 
> single- and multi-value, with and without doc-values. Sort by count only. The 
> patch applies cleanly to Solr 4.6.1 and should integrate well with everything 
> as all functionality is unchanged. After patching, the following new 
> parameters are possible:
> * facet.sparse=true enables sparse faceting.
> * facet.sparse.mintags=1 the minimum amount of unique tags in the given 
> field for sparse faceting to be active. This is used for auto-selecting 
> whether sparse should be used or not.
> * facet.sparse.fraction=0.08 the overhead used for the sparse tracker. 
> Setting this too low means that only very small result sets are handled as 
> sparse. Setting this too high will result in a large performance penalty if 
> the result set blows the sparse tracker. Values between 0.04 and 0.1 seems to 
> work well.
> * facet.sparse.packed=true use PackecInts for counters instead of int[]. This 
> saves memory, but performance will differ. Whether performance will be better 
> or worse depends on the corpus. Experiment with it.
> * facet.sparse.cutoff=0.90 if the estimated number (based on hitcount) of 
> unique tags in the search result exceeds this fraction of the sparse tracker, 
> do not perform sparse tracking. The estimate is based on the assumption that 
> references from documents to tags are distributed randomly.
> * facet.sparse.pool.size=2 the maximum amount of sparse trackers to clear and 
> keep in memory, ready for usage. Clearing and re-using a counter is faster 
> that allocating it fresh from the heap. Setting the pool size to 0 means than 
> a new sparse counter will be allocated each time, just as standard Solr 
> faceting works.
> * facet.sparse.stats=true adds a special tag with timing statistics for 
> sparse faceting.
> * facet.sparse.stats.reset=true resets the timing statistics and clears the 
> pool.
> The parameters needs to be given together with standard faceting parameters, 
> such as facet=true&facet.field=myfield&facet.mincount=1&facet.sort=true. The 
> defaults should be usable, so simply appending facet.sparse=true to

[jira] [Updated] (SOLR-5894) Speed up high-cardinality facets with sparse counters

2014-04-17 Thread Toke Eskildsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toke Eskildsen updated SOLR-5894:
-

Fix Version/s: (was: 4.6.1)
   4.7.1

> Speed up high-cardinality facets with sparse counters
> -
>
> Key: SOLR-5894
> URL: https://issues.apache.org/jira/browse/SOLR-5894
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 4.7.1
>Reporter: Toke Eskildsen
>Priority: Minor
>  Labels: faceted-search, faceting, memory, performance
> Attachments: SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894_test.zip, SOLR-5894_test.zip, SOLR-5894_test.zip, 
> SOLR-5894_test.zip, SOLR-5894_test.zip, 
> author_7M_tags_1852_logged_queries_warmed.png, 
> sparse_200docs_fc_cutoff_20140403-145412.png, 
> sparse_500docs_20140331-151918_multi.png, 
> sparse_500docs_20140331-151918_single.png, 
> sparse_5051docs_20140328-152807.png
>
>
> Field based faceting in Solr has two phases: Collecting counts for tags in 
> facets and extracting the requested tags.
> The execution time for the collecting phase is approximately linear to the 
> number of hits and the number of references from hits to tags. This phase is 
> not the focus here.
> The extraction time scales with the number of unique tags in the search 
> result, but is also heavily influenced by the total number of unique tags in 
> the facet as every counter, 0 or not, is visited by the extractor (at least 
> for count order). For fields with millions of unique tag values this means 
> 10s of milliseconds added to the minimum response time (see 
> https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/
>  for a test on a corpus with 7M unique values in the facet).
> The extractor needs to visit every counter due to the current counter 
> structure being a plain int-array of size #unique_tags. Switching to a sparse 
> structure, where only the tag counters > 0 are visited, makes the extraction 
> time linear to the number of unique tags in the result set.
> Unfortunately the number of unique tags in the result set is unknown at 
> collect time, so it is not possible to reliably select sparse counting vs. 
> full counting up front. Luckily there exists solutions for sparse sets that 
> has the property of switching to non-sparse-mode without a switch-penalty, 
> when the sparse-threshold is exceeded (see 
> http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This 
> JIRA aims to implement this functionality in Solr.
> Current status: Sparse counting is implemented for field cache faceting, both 
> single- and multi-value, with and without doc-values. Sort by count only. The 
> patch applies cleanly to Solr 4.6.1 and should integrate well with everything 
> as all functionality is unchanged. After patching, the following new 
> parameters are possible:
> * facet.sparse=true enables sparse faceting.
> * facet.sparse.mintags=1 the minimum amount of unique tags in the given 
> field for sparse faceting to be active. This is used for auto-selecting 
> whether sparse should be used or not.
> * facet.sparse.fraction=0.08 the overhead used for the sparse tracker. 
> Setting this too low means that only very small result sets are handled as 
> sparse. Setting this too high will result in a large performance penalty if 
> the result set blows the sparse tracker. Values between 0.04 and 0.1 seems to 
> work well.
> * facet.sparse.packed=true use PackecInts for counters instead of int[]. This 
> saves memory, but performance will differ. Whether performance will be better 
> or worse depends on the corpus. Experiment with it.
> * facet.sparse.cutoff=0.90 if the estimated number (based on hitcount) of 
> unique tags in the search result exceeds this fraction of the sparse tracker, 
> do not perform sparse tracking. The estimate is based on the assumption that 
> references from documents to tags are distributed randomly.
> * facet.sparse.pool.size=2 the maximum amount of sparse trackers to clear and 
> keep in memory, ready for usage. Clearing and re-using a counter is faster 
> that allocating it fresh from the heap. Setting the pool size to 0 means than 
> a new sparse counter will be allocated each time, just as standard Solr 
> faceting works.
> * facet.sparse.stats=true adds a special tag with timing statistics for 
> sparse faceting.
> * facet.sparse.stats.reset=true resets the timing statistics and clears the 
> pool.
> The parameters needs to be given together with standard faceting parameters, 
> such as facet=true&facet.field=myfield&facet.mincount=1&facet.sort=true. The 
> defaults should be usable, so simply appen

[jira] [Updated] (SOLR-5894) Speed up high-cardinality facets with sparse counters

2014-04-17 Thread Toke Eskildsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toke Eskildsen updated SOLR-5894:
-

Affects Version/s: (was: 4.6.1)
   (was: 4.7)
   4.7.1

> Speed up high-cardinality facets with sparse counters
> -
>
> Key: SOLR-5894
> URL: https://issues.apache.org/jira/browse/SOLR-5894
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 4.7.1
>Reporter: Toke Eskildsen
>Priority: Minor
>  Labels: faceted-search, faceting, memory, performance
> Attachments: SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894_test.zip, SOLR-5894_test.zip, SOLR-5894_test.zip, 
> SOLR-5894_test.zip, SOLR-5894_test.zip, 
> author_7M_tags_1852_logged_queries_warmed.png, 
> sparse_200docs_fc_cutoff_20140403-145412.png, 
> sparse_500docs_20140331-151918_multi.png, 
> sparse_500docs_20140331-151918_single.png, 
> sparse_5051docs_20140328-152807.png
>
>
> Field based faceting in Solr has two phases: Collecting counts for tags in 
> facets and extracting the requested tags.
> The execution time for the collecting phase is approximately linear to the 
> number of hits and the number of references from hits to tags. This phase is 
> not the focus here.
> The extraction time scales with the number of unique tags in the search 
> result, but is also heavily influenced by the total number of unique tags in 
> the facet as every counter, 0 or not, is visited by the extractor (at least 
> for count order). For fields with millions of unique tag values this means 
> 10s of milliseconds added to the minimum response time (see 
> https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/
>  for a test on a corpus with 7M unique values in the facet).
> The extractor needs to visit every counter due to the current counter 
> structure being a plain int-array of size #unique_tags. Switching to a sparse 
> structure, where only the tag counters > 0 are visited, makes the extraction 
> time linear to the number of unique tags in the result set.
> Unfortunately the number of unique tags in the result set is unknown at 
> collect time, so it is not possible to reliably select sparse counting vs. 
> full counting up front. Luckily there exists solutions for sparse sets that 
> has the property of switching to non-sparse-mode without a switch-penalty, 
> when the sparse-threshold is exceeded (see 
> http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This 
> JIRA aims to implement this functionality in Solr.
> Current status: Sparse counting is implemented for field cache faceting, both 
> single- and multi-value, with and without doc-values. Sort by count only. The 
> patch applies cleanly to Solr 4.6.1 and should integrate well with everything 
> as all functionality is unchanged. After patching, the following new 
> parameters are possible:
> * facet.sparse=true enables sparse faceting.
> * facet.sparse.mintags=1 the minimum amount of unique tags in the given 
> field for sparse faceting to be active. This is used for auto-selecting 
> whether sparse should be used or not.
> * facet.sparse.fraction=0.08 the overhead used for the sparse tracker. 
> Setting this too low means that only very small result sets are handled as 
> sparse. Setting this too high will result in a large performance penalty if 
> the result set blows the sparse tracker. Values between 0.04 and 0.1 seems to 
> work well.
> * facet.sparse.packed=true use PackecInts for counters instead of int[]. This 
> saves memory, but performance will differ. Whether performance will be better 
> or worse depends on the corpus. Experiment with it.
> * facet.sparse.cutoff=0.90 if the estimated number (based on hitcount) of 
> unique tags in the search result exceeds this fraction of the sparse tracker, 
> do not perform sparse tracking. The estimate is based on the assumption that 
> references from documents to tags are distributed randomly.
> * facet.sparse.pool.size=2 the maximum amount of sparse trackers to clear and 
> keep in memory, ready for usage. Clearing and re-using a counter is faster 
> that allocating it fresh from the heap. Setting the pool size to 0 means than 
> a new sparse counter will be allocated each time, just as standard Solr 
> faceting works.
> * facet.sparse.stats=true adds a special tag with timing statistics for 
> sparse faceting.
> * facet.sparse.stats.reset=true resets the timing statistics and clears the 
> pool.
> The parameters needs to be given together with standard faceting parameters, 
> such as facet=true&facet.field=myfield&facet.mincount=1&facet.sort=true. The

[jira] [Updated] (SOLR-5894) Speed up high-cardinality facets with sparse counters

2014-04-11 Thread Toke Eskildsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toke Eskildsen updated SOLR-5894:
-

Description: 
Field based faceting in Solr has two phases: Collecting counts for tags in 
facets and extracting the requested tags.

The execution time for the collecting phase is approximately linear to the 
number of hits and the number of references from hits to tags. This phase is 
not the focus here.

The extraction time scales with the number of unique tags in the search result, 
but is also heavily influenced by the total number of unique tags in the facet 
as every counter, 0 or not, is visited by the extractor (at least for count 
order). For fields with millions of unique tag values this means 10s of 
milliseconds added to the minimum response time (see 
https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/ 
for a test on a corpus with 7M unique values in the facet).

The extractor needs to visit every counter due to the current counter structure 
being a plain int-array of size #unique_tags. Switching to a sparse structure, 
where only the tag counters > 0 are visited, makes the extraction time linear 
to the number of unique tags in the result set.

Unfortunately the number of unique tags in the result set is unknown at collect 
time, so it is not possible to reliably select sparse counting vs. full 
counting up front. Luckily there exists solutions for sparse sets that has the 
property of switching to non-sparse-mode without a switch-penalty, when the 
sparse-threshold is exceeded (see 
http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This JIRA 
aims to implement this functionality in Solr.

Current status: Sparse counting is implemented for field cache faceting, both 
single- and multi-value, with and without doc-values. Sort by count only. The 
patch applies cleanly to Solr 4.6.1 and should integrate well with everything 
as all functionality is unchanged. After patching, the following new parameters 
are possible:

* facet.sparse=true enables sparse faceting.
* facet.sparse.mintags=1 the minimum amount of unique tags in the given 
field for sparse faceting to be active. This is used for auto-selecting whether 
sparse should be used or not.
* facet.sparse.fraction=0.08 the overhead used for the sparse tracker. Setting 
this too low means that only very small result sets are handled as sparse. 
Setting this too high will result in a large performance penalty if the result 
set blows the sparse tracker. Values between 0.04 and 0.1 seems to work well.
* facet.sparse.packed=true use PackecInts for counters instead of int[]. This 
saves memory, but performance will differ. Whether performance will be better 
or worse depends on the corpus. Experiment with it.
* facet.sparse.cutoff=0.90 if the estimated number (based on hitcount) of 
unique tags in the search result exceeds this fraction of the sparse tracker, 
do not perform sparse tracking. The estimate is based on the assumption that 
references from documents to tags are distributed randomly.
* facet.sparse.pool.size=2 the maximum amount of sparse trackers to clear and 
keep in memory, ready for usage. Clearing and re-using a counter is faster that 
allocating it fresh from the heap. Setting the pool size to 0 means than a new 
sparse counter will be allocated each time, just as standard Solr faceting 
works.

* facet.sparse.stats=true adds a special tag with timing statistics for sparse 
faceting.
* facet.sparse.stats.reset=true resets the timing statistics and clears the 
pool.

The parameters needs to be given together with standard faceting parameters, 
such as facet=true&facet.field=myfield&facet.mincount=1&facet.sort=true. The 
defaults should be usable, so simply appending facet.sparse=true to the URL is 
a good start.

  was:
Field based faceting in Solr has two phases: Collecting counts for tags in 
facets and extracting the requested tags.

The execution time for the collecting phase is approximately linear to the 
number of hits and the number of references from hits to tags. This phase is 
not the focus here.

The extraction time scales with the number of unique tags in the search result, 
but is also heavily influenced by the total number of unique tags in the facet 
as every counter, 0 or not, is visited by the extractor (at least for count 
order). For fields with millions of unique tag values this means 10s of 
milliseconds added to the minimum response time (see 
https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/ 
for a test on a corpus with 7M unique values in the facet).

The extractor needs to visit every counter due to the current counter structure 
being a plain int-array of size #unique_tags. Switching to a sparse structure, 
where only the tag counters > 0 are visited, makes the extraction time linear 
to the number of unique tags in the result

[jira] [Updated] (SOLR-5894) Speed up high-cardinality facets with sparse counters

2014-04-11 Thread Toke Eskildsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toke Eskildsen updated SOLR-5894:
-

Attachment: SOLR-5894_test.zip
SOLR-5894.patch

Updated patch to solr 4.7.1 and added optional use of PackedInts for counters 
(saves memory but affects performance).

> Speed up high-cardinality facets with sparse counters
> -
>
> Key: SOLR-5894
> URL: https://issues.apache.org/jira/browse/SOLR-5894
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 4.6.1, 4.7
>Reporter: Toke Eskildsen
>Priority: Minor
> Fix For: 4.6.1
>
> Attachments: SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894_test.zip, SOLR-5894_test.zip, SOLR-5894_test.zip, 
> SOLR-5894_test.zip, SOLR-5894_test.zip, 
> author_7M_tags_1852_logged_queries_warmed.png, 
> sparse_200docs_fc_cutoff_20140403-145412.png, 
> sparse_500docs_20140331-151918_multi.png, 
> sparse_500docs_20140331-151918_single.png, 
> sparse_5051docs_20140328-152807.png
>
>
> Field based faceting in Solr has two phases: Collecting counts for tags in 
> facets and extracting the requested tags.
> The execution time for the collecting phase is approximately linear to the 
> number of hits and the number of references from hits to tags. This phase is 
> not the focus here.
> The extraction time scales with the number of unique tags in the search 
> result, but is also heavily influenced by the total number of unique tags in 
> the facet as every counter, 0 or not, is visited by the extractor (at least 
> for count order). For fields with millions of unique tag values this means 
> 10s of milliseconds added to the minimum response time (see 
> https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/
>  for a test on a corpus with 7M unique values in the facet).
> The extractor needs to visit every counter due to the current counter 
> structure being a plain int-array of size #unique_tags. Switching to a sparse 
> structure, where only the tag counters > 0 are visited, makes the extraction 
> time linear to the number of unique tags in the result set.
> Unfortunately the number of unique tags in the result set is unknown at 
> collect time, so it is not possible to reliably select sparse counting vs. 
> full counting up front. Luckily there exists solutions for sparse sets that 
> has the property of switching to non-sparse-mode without a switch-penalty, 
> when the sparse-threshold is exceeded (see 
> http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This 
> JIRA aims to implement this functionality in Solr.
> Current status: Sparse counting is implemented for field cache faceting, both 
> single- and multi-value, with and without doc-values. Sort by count only. The 
> patch applies cleanly to Solr 4.6.1 and should integrate well with everything 
> as all functionality is unchanged. After patching, the following new 
> parameters are possible:
> * facet.sparse=true enables sparse faceting.
> * facet.sparse.mintags=1 the minimum amount of unique tags in the given 
> field for sparse faceting to be active. This is used for auto-selecting 
> whether sparse should be used or not.
> * facet.sparse.fraction=0.08 the overhead used for the sparse tracker. 
> Setting this too low means that only very small result sets are handled as 
> sparse. Setting this too high will result in a large performance penalty if 
> the result set blows the sparse tracker. Values between 0.04 and 0.1 seems to 
> work well.
> * facet.sparse.cutoff=0.90 if the estimated number (based on hitcount) of 
> unique tags in the search result exceeds this fraction of the sparse tracker, 
> do not perform sparse tracking. The estimate is based on the assumption that 
> references from documents to tags are distributed randomly.
> * facet.sparse.pool.size=2 the maximum amount of sparse trackers to clear and 
> keep in memory, ready for usage. Clearing and re-using a counter is faster 
> that allocating it fresh from the heap. Setting the pool size to 0 means than 
> a new sparse counter will be allocated each time, just as standard Solr 
> faceting works.
> * facet.sparse.stats=true adds a special tag with timing statistics for 
> sparse faceting.
> * facet.sparse.stats.reset=true resets the timing statistics and clears the 
> pool.
> The parameters needs to be given together with standard faceting parameters, 
> such as facet=true&facet.field=myfield&facet.mincount=1&facet.sort=true. The 
> defaults should be usable, so simply appending facet.sparse=true to the URL 
> is a good start.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

--

[jira] [Updated] (SOLR-5894) Speed up high-cardinality facets with sparse counters

2014-04-03 Thread Toke Eskildsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toke Eskildsen updated SOLR-5894:
-

Description: 
Field based faceting in Solr has two phases: Collecting counts for tags in 
facets and extracting the requested tags.

The execution time for the collecting phase is approximately linear to the 
number of hits and the number of references from hits to tags. This phase is 
not the focus here.

The extraction time scales with the number of unique tags in the search result, 
but is also heavily influenced by the total number of unique tags in the facet 
as every counter, 0 or not, is visited by the extractor (at least for count 
order). For fields with millions of unique tag values this means 10s of 
milliseconds added to the minimum response time (see 
https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/ 
for a test on a corpus with 7M unique values in the facet).

The extractor needs to visit every counter due to the current counter structure 
being a plain int-array of size #unique_tags. Switching to a sparse structure, 
where only the tag counters > 0 are visited, makes the extraction time linear 
to the number of unique tags in the result set.

Unfortunately the number of unique tags in the result set is unknown at collect 
time, so it is not possible to reliably select sparse counting vs. full 
counting up front. Luckily there exists solutions for sparse sets that has the 
property of switching to non-sparse-mode without a switch-penalty, when the 
sparse-threshold is exceeded (see 
http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This JIRA 
aims to implement this functionality in Solr.

Current status: Sparse counting is implemented for field cache faceting, both 
single- and multi-value, with and without doc-values. Sort by count only. The 
patch applies cleanly to Solr 4.6.1 and should integrate well with everything 
as all functionality is unchanged. After patching, the following new parameters 
are possible:

* facet.sparse=true enables sparse faceting.
* facet.sparse.mintags=1 the minimum amount of unique tags in the given 
field for sparse faceting to be active. This is used for auto-selecting whether 
sparse should be used or not.
* facet.sparse.fraction=0.08 the overhead used for the sparse tracker. Setting 
this too low means that only very small result sets are handled as sparse. 
Setting this too high will result in a large performance penalty if the result 
set blows the sparse tracker. Values between 0.04 and 0.1 seems to work well.
* facet.sparse.cutoff=0.90 if the estimated number (based on hitcount) of 
unique tags in the search result exceeds this fraction of the sparse tracker, 
do not perform sparse tracking. The estimate is based on the assumption that 
references from documents to tags are distributed randomly.
* facet.sparse.pool.size=2 the maximum amount of sparse trackers to clear and 
keep in memory, ready for usage. Clearing and re-using a counter is faster that 
allocating it fresh from the heap. Setting the pool size to 0 means than a new 
sparse counter will be allocated each time, just as standard Solr faceting 
works.

* facet.sparse.stats=true adds a special tag with timing statistics for sparse 
faceting.
* facet.sparse.stats.reset=true resets the timing statistics and clears the 
pool.

The parameters needs to be given together with standard faceting parameters, 
such as facet=true&facet.field=myfield&facet.mincount=1&facet.sort=true. The 
defaults should be usable, so simply appending facet.sparse=true to the URL is 
a good start.

  was:
Field based faceting in Solr has two phases: Collecting counts for tags in 
facets and extracting the requested tags.

The execution time for the collecting phase is approximately linear to the 
number of hits and the number of references from hits to tags. This phase is 
not the focus here.

The extraction time scales with the number of unique tags in the search result, 
but is also heavily influenced by the total number of unique tags in the facet 
as every counter, 0 or not, is visited by the extractor (at least for count 
order). For fields with millions of unique tag values this means 10s of 
milliseconds added to the minimum response time (see 
https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/ 
for a test on a corpus with 7M unique values in the facet).

The extractor needs to visit every counter due to the current counter structure 
being a plain int-array of size #unique_tags. Switching to a sparse structure, 
where only the tag counters > 0 are visited, makes the extraction time linear 
to the number of unique tags in the result set.

Unfortunately the number of unique tags in the result set is unknown at collect 
time, so it is not possible to reliably select sparse counting vs. full 
counting up front. Luckily there exists solutions

[jira] [Updated] (SOLR-5894) Speed up high-cardinality facets with sparse counters

2014-04-03 Thread Toke Eskildsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toke Eskildsen updated SOLR-5894:
-

Attachment: sparse_200docs_fc_cutoff_20140403-145412.png
SOLR-5894_test.zip
SOLR-5894.patch

Updated patch to support adjustable heuristic guessing of result size and 
selection of implementation based on the guess. Conservative guesses means 
early fallback to standard Solr faceting speed, overly optimistic guesses means 
a performance penalty on wrong guess. Correct guesses means that faceting with 
sparse counting is either faster or the same speed as standard Solr faceting. 
See attached graph sparse_200docs_fc_cutoff_20140403-145412.png for that 
scenario.

> Speed up high-cardinality facets with sparse counters
> -
>
> Key: SOLR-5894
> URL: https://issues.apache.org/jira/browse/SOLR-5894
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 4.6.1, 4.7
>Reporter: Toke Eskildsen
>Priority: Minor
> Fix For: 4.6.1
>
> Attachments: SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, SOLR-5894_test.zip, 
> SOLR-5894_test.zip, SOLR-5894_test.zip, SOLR-5894_test.zip, 
> author_7M_tags_1852_logged_queries_warmed.png, 
> sparse_200docs_fc_cutoff_20140403-145412.png, 
> sparse_500docs_20140331-151918_multi.png, 
> sparse_500docs_20140331-151918_single.png, 
> sparse_5051docs_20140328-152807.png
>
>
> Field based faceting in Solr has two phases: Collecting counts for tags in 
> facets and extracting the requested tags.
> The execution time for the collecting phase is approximately linear to the 
> number of hits and the number of references from hits to tags. This phase is 
> not the focus here.
> The extraction time scales with the number of unique tags in the search 
> result, but is also heavily influenced by the total number of unique tags in 
> the facet as every counter, 0 or not, is visited by the extractor (at least 
> for count order). For fields with millions of unique tag values this means 
> 10s of milliseconds added to the minimum response time (see 
> https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/
>  for a test on a corpus with 7M unique values in the facet).
> The extractor needs to visit every counter due to the current counter 
> structure being a plain int-array of size #unique_tags. Switching to a sparse 
> structure, where only the tag counters > 0 are visited, makes the extraction 
> time linear to the number of unique tags in the result set.
> Unfortunately the number of unique tags in the result set is unknown at 
> collect time, so it is not possible to reliably select sparse counting vs. 
> full counting up front. Luckily there exists solutions for sparse sets that 
> has the property of switching to non-sparse-mode without a switch-penalty, 
> when the sparse-threshold is exceeded (see 
> http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This 
> JIRA aims to implement this functionality in Solr.
> Current status: Sparse counting is implemented for field cache faceting, both 
> single- and multi-value, with and without doc-values. Sort by count only. The 
> patch applies cleanly to Solr 4.6.1 and should integrate well with everything 
> as all functionality is unchanged. After patching, the following new 
> parameters are possible:
> * facet.sparse=true enables sparse faceting.
> * facet.sparse.mintags=1 the minimum amount of unique tags in the given 
> field for sparse faceting to be active. This is used for auto-selecting 
> whether sparse should be used or not.
> * facet.sparse.fraction=0.08 the overhead used for the sparse tracker. 
> Setting this too low means that only very small result sets are handled as 
> sparse. Setting this too high will result in a large performance penalty if 
> the result set blows the sparse tracker. Values between 0.04 and 0.1 seems to 
> work well.
> * facet.sparse.pool.size=2 the maximum amount of sparse trackers to clear and 
> keep in memory, ready for usage. Clearing and re-using a counter is faster 
> that allocating it fresh from the heap. Setting the pool size to 0 means than 
> a new sparse counter will be allocated each time, just as standard Solr 
> faceting works.
> * facet.sparse.stats=true adds a special tag with timing statistics for 
> sparse faceting.
> * facet.sparse.stats.reset=true resets the timing statistics and clears the 
> pool.
> The parameters needs to be given together with standard faceting parameters, 
> such as facet=true&facet.field=myfield&facet.mincount=1&facet.sort=true. The 
> defaults should be usable, so simply appending facet.sparse=true to the URL 
> i

[jira] [Updated] (SOLR-5894) Speed up high-cardinality facets with sparse counters

2014-03-31 Thread Toke Eskildsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toke Eskildsen updated SOLR-5894:
-

Description: 
Field based faceting in Solr has two phases: Collecting counts for tags in 
facets and extracting the requested tags.

The execution time for the collecting phase is approximately linear to the 
number of hits and the number of references from hits to tags. This phase is 
not the focus here.

The extraction time scales with the number of unique tags in the search result, 
but is also heavily influenced by the total number of unique tags in the facet 
as every counter, 0 or not, is visited by the extractor (at least for count 
order). For fields with millions of unique tag values this means 10s of 
milliseconds added to the minimum response time (see 
https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/ 
for a test on a corpus with 7M unique values in the facet).

The extractor needs to visit every counter due to the current counter structure 
being a plain int-array of size #unique_tags. Switching to a sparse structure, 
where only the tag counters > 0 are visited, makes the extraction time linear 
to the number of unique tags in the result set.

Unfortunately the number of unique tags in the result set is unknown at collect 
time, so it is not possible to reliably select sparse counting vs. full 
counting up front. Luckily there exists solutions for sparse sets that has the 
property of switching to non-sparse-mode without a switch-penalty, when the 
sparse-threshold is exceeded (see 
http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This JIRA 
aims to implement this functionality in Solr.

Current status: Sparse counting is implemented for field cache faceting, both 
single- and multi-value, with and without doc-values. Sort by count only. The 
patch applies cleanly to Solr 4.6.1 and should integrate well with everything 
as all functionality is unchanged. After patching, the following new parameters 
are possible:

* facet.sparse=true enables sparse faceting.
* facet.sparse.mintags=1 the minimum amount of unique tags in the given 
field for sparse faceting to be active. This is used for auto-selecting whether 
sparse should be used or not.
* facet.sparse.fraction=0.08 the overhead used for the sparse tracker. Setting 
this too low means that only very small result sets are handled as sparse. 
Setting this too high will result in a large performance penalty if the result 
set blows the sparse tracker. Values between 0.04 and 0.1 seems to work well.
* facet.sparse.pool.size=2 the maximum amount of sparse trackers to clear and 
keep in memory, ready for usage. Clearing and re-using a counter is faster that 
allocating it fresh from the heap. Setting the pool size to 0 means than a new 
sparse counter will be allocated each time, just as standard Solr faceting 
works.

* facet.sparse.stats=true adds a special tag with timing statistics for sparse 
faceting.
* facet.sparse.stats.reset=true resets the timing statistics and clears the 
pool.

The parameters needs to be given together with standard faceting parameters, 
such as facet=true&facet.field=myfield&facet.mincount=1&facet.sort=true. The 
defaults should be usable, so simply appending facet.sparse=true to the URL is 
a good start.

  was:
Field based faceting in Solr has two phases: Collecting counts for tags in 
facets and extracting the requested tags.

The execution time for the collecting phase is approximately linear to the 
number of hits and the number of references from hits to tags. This phase is 
not the focus here.

The extraction time scales with the number of unique tags in the search result, 
but is also heavily influenced by the total number of unique tags in the facet 
as every counter, 0 or not, is visited by the extractor (at least for count 
order). For fields with millions of unique tag values this means 10s of 
milliseconds added to the minimum response time (see 
https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/ 
for a test on a corpus with 7M unique values in the facet).

The extractor needs to visit every counter due to the current counter structure 
being a plain int-array of size #unique_tags. Switching to a sparse structure, 
where only the tag counters > 0 are visited, makes the extraction time linear 
to the number of unique tags in the result set.

Unfortunately the number of unique tags in the result set is unknown at collect 
time, so it is not possible to reliably select sparse counting vs. full 
counting up front. Luckily there exists solutions for sparse sets that has the 
property of switching to non-sparse-mode without a switch-penalty, when the 
sparse-threshold is exceeded (see 
http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This JIRA 
aims to implement this functionality in Solr.

Current status: Spars

[jira] [Updated] (SOLR-5894) Speed up high-cardinality facets with sparse counters

2014-03-31 Thread Toke Eskildsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toke Eskildsen updated SOLR-5894:
-

Description: 
Field based faceting in Solr has two phases: Collecting counts for tags in 
facets and extracting the requested tags.

The execution time for the collecting phase is approximately linear to the 
number of hits and the number of references from hits to tags. This phase is 
not the focus here.

The extraction time scales with the number of unique tags in the search result, 
but is also heavily influenced by the total number of unique tags in the facet 
as every counter, 0 or not, is visited by the extractor (at least for count 
order). For fields with millions of unique tag values this means 10s of 
milliseconds added to the minimum response time (see 
https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/ 
for a test on a corpus with 7M unique values in the facet).

The extractor needs to visit every counter due to the current counter structure 
being a plain int-array of size #unique_tags. Switching to a sparse structure, 
where only the tag counters > 0 are visited, makes the extraction time linear 
to the number of unique tags in the result set.

Unfortunately the number of unique tags in the result set is unknown at collect 
time, so it is not possible to reliably select sparse counting vs. full 
counting up front. Luckily there exists solutions for sparse sets that has the 
property of switching to non-sparse-mode without a switch-penalty, when the 
sparse-threshold is exceeded (see 
http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This JIRA 
aims to implement this functionality in Solr.

Current status: Sparse counting is implemented for field cache faceting, both 
single- and multi-value, with and without doc-values. The patch applies cleanly 
to Solr 4.6.1 and should integrate well with everything as all functionality is 
unchanged. After patching, the following new parameters are possible:



  was:
Field based faceting in Solr has two phases: Collecting counts for tags in 
facets and extracting the requested tags.

The execution time for the collecting phase is approximately linear to the 
number of hits and the number of references from hits to tags. This phase is 
not the focus here.

The extraction time scales with the number of unique tags in the search result, 
but is also heavily influenced by the total number of unique tags in the facet 
as every counter, 0 or not, is visited by the extractor (at least for count 
order). For fields with millions of unique tag values this means 10s of 
milliseconds added to the minimum response time (see 
https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/ 
for a test on a corpus with 7M unique values in the facet).

The extractor needs to visit every counter due to the current counter structure 
being a plain int-array of size #unique_tags. Switching to a sparse structure, 
where only the tag counters > 0 are visited, makes the extraction time linear 
to the number of unique tags in the result set.

Unfortunately the number of unique tags in the result set is unknown at collect 
time, so it is not possible to reliably select sparse counting vs. full 
counting up front. Luckily there exists solutions for sparse sets that has the 
property of switching to non-sparse-mode without a switch-penalty, when the 
sparse-threshold is exceeded (see 
http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This JIRA 
aims to implement this functionality in Solr (a proof of concept patch will be 
provided shortly).


> Speed up high-cardinality facets with sparse counters
> -
>
> Key: SOLR-5894
> URL: https://issues.apache.org/jira/browse/SOLR-5894
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 4.6.1, 4.7
>Reporter: Toke Eskildsen
>Priority: Minor
> Fix For: 4.6.1
>
> Attachments: SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894_test.zip, SOLR-5894_test.zip, 
> SOLR-5894_test.zip, author_7M_tags_1852_logged_queries_warmed.png, 
> sparse_500docs_20140331-151918_multi.png, 
> sparse_500docs_20140331-151918_single.png, 
> sparse_5051docs_20140328-152807.png
>
>
> Field based faceting in Solr has two phases: Collecting counts for tags in 
> facets and extracting the requested tags.
> The execution time for the collecting phase is approximately linear to the 
> number of hits and the number of references from hits to tags. This phase is 
> not the focus here.
> The extraction time scales with the number of unique tags in the search 
> result, but is also heavily influenced by the total number of unique tags i

[jira] [Updated] (SOLR-5894) Speed up high-cardinality facets with sparse counters

2014-03-31 Thread Toke Eskildsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toke Eskildsen updated SOLR-5894:
-

Attachment: sparse_500docs_20140331-151918_single.png
sparse_500docs_20140331-151918_multi.png
SOLR-5894_test.zip
SOLR-5894.patch

Patch-update: Sparse counting is now possible for field cache, with and without 
DocValues, single- as well as multi-field. Segment based field cache faceting 
support is not implemented yet. 

See the attached graphs sparse_500docs_20140331-151918_multi.png and 
sparse_500docs_20140331-151918_single.png for experiments with 5M values, 
multi-value (3 values/doc) and single-value. Note that the y-axis is now 
logarithmic. It seems DocValues benefits nearly as much from sparse counters as 
non-DocValues.

Also note that these measurements are from an artificially large sparse tracker 
(100% overhead) and as such are not representative for a realistic setup.

> Speed up high-cardinality facets with sparse counters
> -
>
> Key: SOLR-5894
> URL: https://issues.apache.org/jira/browse/SOLR-5894
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 4.6.1, 4.7
>Reporter: Toke Eskildsen
>Priority: Minor
> Fix For: 4.6.1
>
> Attachments: SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894_test.zip, SOLR-5894_test.zip, 
> SOLR-5894_test.zip, author_7M_tags_1852_logged_queries_warmed.png, 
> sparse_500docs_20140331-151918_multi.png, 
> sparse_500docs_20140331-151918_single.png, 
> sparse_5051docs_20140328-152807.png
>
>
> Field based faceting in Solr has two phases: Collecting counts for tags in 
> facets and extracting the requested tags.
> The execution time for the collecting phase is approximately linear to the 
> number of hits and the number of references from hits to tags. This phase is 
> not the focus here.
> The extraction time scales with the number of unique tags in the search 
> result, but is also heavily influenced by the total number of unique tags in 
> the facet as every counter, 0 or not, is visited by the extractor (at least 
> for count order). For fields with millions of unique tag values this means 
> 10s of milliseconds added to the minimum response time (see 
> https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/
>  for a test on a corpus with 7M unique values in the facet).
> The extractor needs to visit every counter due to the current counter 
> structure being a plain int-array of size #unique_tags. Switching to a sparse 
> structure, where only the tag counters > 0 are visited, makes the extraction 
> time linear to the number of unique tags in the result set.
> Unfortunately the number of unique tags in the result set is unknown at 
> collect time, so it is not possible to reliably select sparse counting vs. 
> full counting up front. Luckily there exists solutions for sparse sets that 
> has the property of switching to non-sparse-mode without a switch-penalty, 
> when the sparse-threshold is exceeded (see 
> http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This 
> JIRA aims to implement this functionality in Solr (a proof of concept patch 
> will be provided shortly).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-5894) Speed up high-cardinality facets with sparse counters

2014-03-28 Thread Toke Eskildsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toke Eskildsen updated SOLR-5894:
-

Attachment: SOLR-5894_test.zip
sparse_5051docs_20140328-152807.png
SOLR-5894.patch

Patch with several bugfixes and updated test scripts.

The chart sparse_5051docs_20140328-152807.png shows results form a test run 
on 50M documents / 150M unique tags. The black line is standard Solr, the red 
is a sparse counter with 100% overhead, the cyan is an attempt of compromising 
with 8% sparse overhead.

The performance hit for passing the 8% cut-off is huge. One way to avoid it 
could be to make a numhits-based guess of the expected number of unique tags, 
when choosing facet implementation. The consequence of a wrong guess would 
"just" be poorer performance, with the worst-case being the max of the black 
and the cyan line for any given number of unique tags in the search result.

> Speed up high-cardinality facets with sparse counters
> -
>
> Key: SOLR-5894
> URL: https://issues.apache.org/jira/browse/SOLR-5894
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 4.6.1, 4.7
>Reporter: Toke Eskildsen
>Priority: Minor
> Fix For: 4.6.1
>
> Attachments: SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894_test.zip, SOLR-5894_test.zip, 
> author_7M_tags_1852_logged_queries_warmed.png, 
> sparse_5051docs_20140328-152807.png
>
>
> Field based faceting in Solr has two phases: Collecting counts for tags in 
> facets and extracting the requested tags.
> The execution time for the collecting phase is approximately linear to the 
> number of hits and the number of references from hits to tags. This phase is 
> not the focus here.
> The extraction time scales with the number of unique tags in the search 
> result, but is also heavily influenced by the total number of unique tags in 
> the facet as every counter, 0 or not, is visited by the extractor (at least 
> for count order). For fields with millions of unique tag values this means 
> 10s of milliseconds added to the minimum response time (see 
> https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/
>  for a test on a corpus with 7M unique values in the facet).
> The extractor needs to visit every counter due to the current counter 
> structure being a plain int-array of size #unique_tags. Switching to a sparse 
> structure, where only the tag counters > 0 are visited, makes the extraction 
> time linear to the number of unique tags in the result set.
> Unfortunately the number of unique tags in the result set is unknown at 
> collect time, so it is not possible to reliably select sparse counting vs. 
> full counting up front. Luckily there exists solutions for sparse sets that 
> has the property of switching to non-sparse-mode without a switch-penalty, 
> when the sparse-threshold is exceeded (see 
> http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This 
> JIRA aims to implement this functionality in Solr (a proof of concept patch 
> will be provided shortly).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-5894) Speed up high-cardinality facets with sparse counters

2014-03-28 Thread Toke Eskildsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toke Eskildsen updated SOLR-5894:
-

Attachment: SOLR-5894_test.zip
SOLR-5894.patch

Bugfix and further instrumentation of the sparse implementation. The attached 
ZIP contains scripts for artificial testing of performance (requires bash, curl 
& gnuplot).

> Speed up high-cardinality facets with sparse counters
> -
>
> Key: SOLR-5894
> URL: https://issues.apache.org/jira/browse/SOLR-5894
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 4.6.1, 4.7
>Reporter: Toke Eskildsen
>Priority: Minor
> Fix For: 4.6.1
>
> Attachments: SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894_test.zip, author_7M_tags_1852_logged_queries_warmed.png
>
>
> Field based faceting in Solr has two phases: Collecting counts for tags in 
> facets and extracting the requested tags.
> The execution time for the collecting phase is approximately linear to the 
> number of hits and the number of references from hits to tags. This phase is 
> not the focus here.
> The extraction time scales with the number of unique tags in the search 
> result, but is also heavily influenced by the total number of unique tags in 
> the facet as every counter, 0 or not, is visited by the extractor (at least 
> for count order). For fields with millions of unique tag values this means 
> 10s of milliseconds added to the minimum response time (see 
> https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/
>  for a test on a corpus with 7M unique values in the facet).
> The extractor needs to visit every counter due to the current counter 
> structure being a plain int-array of size #unique_tags. Switching to a sparse 
> structure, where only the tag counters > 0 are visited, makes the extraction 
> time linear to the number of unique tags in the result set.
> Unfortunately the number of unique tags in the result set is unknown at 
> collect time, so it is not possible to reliably select sparse counting vs. 
> full counting up front. Luckily there exists solutions for sparse sets that 
> has the property of switching to non-sparse-mode without a switch-penalty, 
> when the sparse-threshold is exceeded (see 
> http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This 
> JIRA aims to implement this functionality in Solr (a proof of concept patch 
> will be provided shortly).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-5894) Speed up high-cardinality facets with sparse counters

2014-03-27 Thread Toke Eskildsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toke Eskildsen updated SOLR-5894:
-

Attachment: SOLR-5894.patch

Slightly updated patch where statistics are returned as part of the facet 
result if facet.sparse.stats=true. Scripts for easy testing of performance will 
follow later.

> Speed up high-cardinality facets with sparse counters
> -
>
> Key: SOLR-5894
> URL: https://issues.apache.org/jira/browse/SOLR-5894
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 4.6.1, 4.7
>Reporter: Toke Eskildsen
>Priority: Minor
> Fix For: 4.6.1
>
> Attachments: SOLR-5894.patch, SOLR-5894.patch, 
> author_7M_tags_1852_logged_queries_warmed.png
>
>
> Field based faceting in Solr has two phases: Collecting counts for tags in 
> facets and extracting the requested tags.
> The execution time for the collecting phase is approximately linear to the 
> number of hits and the number of references from hits to tags. This phase is 
> not the focus here.
> The extraction time scales with the number of unique tags in the search 
> result, but is also heavily influenced by the total number of unique tags in 
> the facet as every counter, 0 or not, is visited by the extractor (at least 
> for count order). For fields with millions of unique tag values this means 
> 10s of milliseconds added to the minimum response time (see 
> https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/
>  for a test on a corpus with 7M unique values in the facet).
> The extractor needs to visit every counter due to the current counter 
> structure being a plain int-array of size #unique_tags. Switching to a sparse 
> structure, where only the tag counters > 0 are visited, makes the extraction 
> time linear to the number of unique tags in the result set.
> Unfortunately the number of unique tags in the result set is unknown at 
> collect time, so it is not possible to reliably select sparse counting vs. 
> full counting up front. Luckily there exists solutions for sparse sets that 
> has the property of switching to non-sparse-mode without a switch-penalty, 
> when the sparse-threshold is exceeded (see 
> http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This 
> JIRA aims to implement this functionality in Solr (a proof of concept patch 
> will be provided shortly).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-5894) Speed up high-cardinality facets with sparse counters

2014-03-21 Thread Toke Eskildsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toke Eskildsen updated SOLR-5894:
-

Attachment: author_7M_tags_1852_logged_queries_warmed.png

Results from a small scale test of 2 threads sending logged queries against our 
~50GB, 11M document library index, faceting on 'author' with 7M unique values. 
The test was executed multiple times and the last (and thus fully warmed) 
results were used. Only searches with facet results were counted.

Results for this particular test were median response times of 36ms for 
standard facet counting and 24ms for sparse. See the attached PNG for graphs. 
Your Mileage Will Vary.

> Speed up high-cardinality facets with sparse counters
> -
>
> Key: SOLR-5894
> URL: https://issues.apache.org/jira/browse/SOLR-5894
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 4.6.1, 4.7
>Reporter: Toke Eskildsen
>Priority: Minor
> Fix For: 4.6.1
>
> Attachments: SOLR-5894.patch, 
> author_7M_tags_1852_logged_queries_warmed.png
>
>
> Field based faceting in Solr has two phases: Collecting counts for tags in 
> facets and extracting the requested tags.
> The execution time for the collecting phase is approximately linear to the 
> number of hits and the number of references from hits to tags. This phase is 
> not the focus here.
> The extraction time scales with the number of unique tags in the search 
> result, but is also heavily influenced by the total number of unique tags in 
> the facet as every counter, 0 or not, is visited by the extractor (at least 
> for count order). For fields with millions of unique tag values this means 
> 10s of milliseconds added to the minimum response time (see 
> https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/
>  for a test on a corpus with 7M unique values in the facet).
> The extractor needs to visit every counter due to the current counter 
> structure being a plain int-array of size #unique_tags. Switching to a sparse 
> structure, where only the tag counters > 0 are visited, makes the extraction 
> time linear to the number of unique tags in the result set.
> Unfortunately the number of unique tags in the result set is unknown at 
> collect time, so it is not possible to reliably select sparse counting vs. 
> full counting up front. Luckily there exists solutions for sparse sets that 
> has the property of switching to non-sparse-mode without a switch-penalty, 
> when the sparse-threshold is exceeded (see 
> http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This 
> JIRA aims to implement this functionality in Solr (a proof of concept patch 
> will be provided shortly).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-5894) Speed up high-cardinality facets with sparse counters

2014-03-21 Thread Toke Eskildsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toke Eskildsen updated SOLR-5894:
-

Attachment: SOLR-5894.patch

Simple proof of concept patch for sparse counting for field-based faceting on 
high-cardinality multi-value fields.

Apply patch, build Solr, open an index with a proper high-cardinality field (> 
100.000 unique values) such as author or title. Standard field based faceting 
works as always.

Activate sparse counting by appending facet.sparse=true to the request. Control 
the amount of extra space used for the sparse structure with 
facet.sparse.fraction (default is 0.025, which seems to work well).

Tested against Solr 4.6.1. Not tested against 4.7.

> Speed up high-cardinality facets with sparse counters
> -
>
> Key: SOLR-5894
> URL: https://issues.apache.org/jira/browse/SOLR-5894
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 4.6.1, 4.7
>Reporter: Toke Eskildsen
>Priority: Minor
> Fix For: 4.6.1
>
> Attachments: SOLR-5894.patch
>
>
> Field based faceting in Solr has two phases: Collecting counts for tags in 
> facets and extracting the requested tags.
> The execution time for the collecting phase is approximately linear to the 
> number of hits and the number of references from hits to tags. This phase is 
> not the focus here.
> The extraction time scales with the number of unique tags in the search 
> result, but is also heavily influenced by the total number of unique tags in 
> the facet as every counter, 0 or not, is visited by the extractor (at least 
> for count order). For fields with millions of unique tag values this means 
> 10s of milliseconds added to the minimum response time (see 
> https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/
>  for a test on a corpus with 7M unique values in the facet).
> The extractor needs to visit every counter due to the current counter 
> structure being a plain int-array of size #unique_tags. Switching to a sparse 
> structure, where only the tag counters > 0 are visited, makes the extraction 
> time linear to the number of unique tags in the result set.
> Unfortunately the number of unique tags in the result set is unknown at 
> collect time, so it is not possible to reliably select sparse counting vs. 
> full counting up front. Luckily there exists solutions for sparse sets that 
> has the property of switching to non-sparse-mode without a switch-penalty, 
> when the sparse-threshold is exceeded (see 
> http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This 
> JIRA aims to implement this functionality in Solr (a proof of concept patch 
> will be provided shortly).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org