[ 
https://issues.apache.org/jira/browse/SOLR-3583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403331#comment-13403331
 ] 

Chris Russell edited comment on SOLR-3583 at 6/28/12 6:29 PM:
--------------------------------------------------------------

This patch builds upon the distributed pivot facets introduced in SOLR-2894 and 
adds the ability to request rudimentary percentiles when faceting.  The 
percentiles are calculated by using range facets to create "buckets" which 
divide up the field in question.  A range facet is done on each bucket to 
determine the number of documents whose value falls within that bucket.  An 
average value for each bucket is determined by averaging the upper and lower 
bound of that bucket.  The count of documents for each bucket as well as the 
bucket average are used when determining percentiles, with the bucket average 
being returned as the percentile value.  Thus the accuracy of the value is 
determined by bucket size.  Smaller buckets will yield more accurate values but 
will be more computationally intensive.  

The choice to use buckets and have "fuzzy" values was made because 1) We were 
using query facets to do this already and desired a solution that involved less 
querying and 2) Our use case involves document counts on the order of tens of 
millions and distributed coalescing distinct values during distributed search 
seemed problematic from a performance standpoint.

Usage:
  Querying:
  Faceting must be enabled (facet=true).  Then you may use the following 
parameters to define your percentiles request:
  percentiles=true : enables facet statistics
  percentiles.field=fieldname : field to calculate facets for; can be specified 
more than once
  percentiles.requested.percentiles=25,50,75 : requested percentiles i.e. 
25th,50th,75th
  percentiles.lower.fence=0 : lower bound for percentiles calculation i.e. 
lower edge of first bucket
  percentiles.upper.fence=5000 : upper bound for percentiles calculation i.e. 
upper edge of last bucket
  percentiles.gap=10 : bucket size i.e. bucket1 0-10, bucket2 10-20, etc 
(double counting on edges avoided similar to range facets)
  percentiles.averages=true : set this if you would like average and doc count 
reported for each field (average is weighted average of bucket midpoints)
  facet.pivot=field1,field2 : if you ask for pivots, percentiles will show up 
on a per-pivot basis!

Here is an example URL using the example documents included with solr:
http://localhost:8983/solr/select?q=*%3A*&start=0&rows=3&wt=xml&facet=true&percentiles=true&percentiles.field=popularity&percentiles.requested.percentiles=25,50,75&percentiles.averages=true&facet.field=price&facet.field=popularity&facet.pivot=manufacturedate_dt&f.popularity.percentiles.lower.fence=0&f.popularity.percentiles.upper.fence=11&f.popularity.percentiles.gap=1&facet.sort=index&percentiles.field=price&percentiles.lower.fence=0&percentiles.upper.fence=5000&percentiles.gap=10
  

Results format:
  If percentiles are requested the "facet_statistics" node will show under 
"facet_counts". Each field requested will have its own subsection.  Each 
subsection will contain percentiles and optionally average and count.
  If pivot facets are also requested, each level of pivot will have a 
"statistics" section that will contain per-field info similar to that found in 
"facet_statistics" above.

Notes:
  All field types that range facets support are supported, however average on a 
date field will always return as 0. Apologies.
  Works in distributed mode!
  Includes a unit test.
  If you're curious about what settings are used internally for the range 
faceting, it is:
        rangeHardEnd = false;
        includeLower = true;
        includeUpper = false;
        includeEdge = false;
        
                
      was (Author: selah):
    This patch builds upon the distributed pivot facets introduced in SOLR-2894 
and adds the ability to request rudimentary percentiles when faceting.  The 
percentiles are calculated by using range facets to create "buckets" which 
divide up the field in question.  A range facet is done on each bucket to 
determine the number of documents whose value falls within that bucket.  An 
average value for each bucket is determined by averaging the upper and lower 
bound of that bucket.  The count of documents for each bucket as well as the 
bucket average are used when determining percentiles, with the bucket average 
being returned as the percentile value.  Thus the accuracy of the value is 
determined by bucket size.  Smaller buckets will yield more accurate values but 
will be more computationally intensive.  

The choice to use buckets and have "fuzzy" values was made because 1) We were 
using query facets to do this already and desired a solution that involved less 
querying and 2) Our use case involves document counts on the order of tens of 
millions and distributed coalescing distinct values during distributed search 
seemed problematic from a performance standpoint.

Usage:
  Querying:
  Faceting must be enabled (facet=true).  Then you may use the following 
parameters to define your percentiles request:
  percentiles=true : enables facet statistics
  percentiles.field=fieldname : field to calculate facets for; can be specified 
more than once
  percentiles.requested.percentiles=25,50,75 : requested percentiles i.e. 
25th,50th,75th
  percentiles.lower.fence=0 : lower bound for percentiles calculation i.e. 
lower edge of first bucket
  percentiles.upper.fence=5000 : upper bound for percentiles calculation i.e. 
upper edge of last bucket
  percentiles.gap=10 : bucket size i.e. bucket1 0-10, bucket2 10-20, etc 
(double counting on edges avoided similar to range facets)
  percentiles.averages=true : set this if you would like average and doc count 
reported for each field (average is weighted average of bucket midpoints)
  facet.pivot=field1,field2 : if you ask for pivots, percentiles will show up 
on a per-pivot basis!

Here is an example URL using the example documents included with solr:
http://localhost:8983/solr/select?q=*%3A*&start=0&rows=3&wt=xml&facet=true&percentiles=true&percentiles.field=popularity&percentiles.requested.percentiles=25,50,75&percentiles.averages=true&facet.field=price&facet.field=popularity&facet.pivot=manufacturedate_dt&f.popularity.percentiles.lower.fence=0&f.popularity.percentiles.upper.fence=11&f.popularity.percentiles.gap=1&facet.sort=index&percentiles.field=price&percentiles.lower.fence=0&percentiles.upper.fence=5000&percentiles.gap=10
  

Results format:
  If percentiles are requested the "facet_statistics" node will show under 
"facet_counts". Each field requested will have its own subsection.  Each 
subsection will contain percentiles and optionally average and count.
  If pivot facets are also requested, each level of pivot will have a 
"statistics" section that will contain per-field info similar to that found in 
"facet_statistics" above.

Notes:
  All field types that range facets support are supported, however average on a 
date field will always return as 0. Apologies.
  Works in distributed mode!
  Includes a unit test.

                  
> Percentiles for facets, pivot facets, and distributed pivot facets
> ------------------------------------------------------------------
>
>                 Key: SOLR-3583
>                 URL: https://issues.apache.org/jira/browse/SOLR-3583
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Chris Russell
>            Priority: Minor
>              Labels: newbie, patch
>             Fix For: 4.0
>
>         Attachments: SOLR-3583.patch
>
>
> Built on top of SOLR-2894 (includes Apr 25th version) this patch adds 
> percentiles and averages to facets, pivot facets, and distributed pivot 
> facets by making use of range facet internals.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to