[jira] [Updated] (SOLR-6968) add hyperloglog in statscomponent as an approximate count

Hoss Man (JIRA) Fri, 01 May 2015 09:50:42 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hoss Man updated SOLR-6968:
---------------------------
    Attachment: SOLR-6968.patch

Updated patch with more tests.

My current TODO list...

{noformat}
 - 6 < regwidth makes no sense? 
   - even at (min) log2m==4, isn't regwidth==6 big enough for all possible 
(hashed) long values?

 - prehashed support
   - need to sanity/error check that the field is a long
   - add an update processor to make this easy to do at index time
 - tunning knobs
   - memory vs accuracy (log2m)
     - idea: (least ram) 0 < accuracy < 1 (most accurate)
       - scale 
   - max cardinality estimatable (regwidth)
     - perhaps hardcode regwidth==6 ? expert only option to adjust?
     - pick regwidth based on field type? (int/enum have fewer in general)
     - pick regwidth based on index stats? max out based on total terms in 
field?
       - or for single valued fields: max out based on numDocs
       - HLL must use same hash seed, but does it support union when log2m and 
regwidth are diff?
 - convinience equivilence with countDistinct in solrj response obj ?
{noformat}

> add hyperloglog in statscomponent as an approximate count
> ---------------------------------------------------------
>
>                 Key: SOLR-6968
>                 URL: https://issues.apache.org/jira/browse/SOLR-6968
>             Project: Solr
>          Issue Type: Sub-task
>            Reporter: Hoss Man
>         Attachments: SOLR-6968.patch, SOLR-6968.patch, SOLR-6968.patch
>
>
> stats component currently supports "calcDistinct" but it's terribly 
> inefficient -- especially in distib mode.
> we should add support for using hyperloglog to compute an approximate count 
> of distinct values (using localparams via SOLR-6349 to control the precision 
> of the approximation)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-6968) add hyperloglog in statscomponent as an approximate count

Reply via email to