[jira] [Updated] (SOLR-6968) add hyperloglog in statscomponent as an approximate count

Hoss Man (JIRA) Tue, 28 Apr 2015 18:39:20 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hoss Man updated SOLR-6968:
---------------------------
    Attachment: SOLR-6968.patch


really simple straw man implementation using java-hll...

https://github.com/aggregateknowledge/java-hll

The bulk of the current patch is in test refactoring because all the special 
case conditionals in StatsComponentTest.testIndividualStatLocalParams were 
driving me insane.

Currently only cardinality of numeric fields is supported (and even then, only 
long fields really work "correctly").  Current syntax is...

{noformat}
/select?q=*:*&stats=true&stats.field={!cardinality=true}fieldname_l
{noformat}

...but i'm thinking that should change ... there's at least two types of knobs 
we should support, i'm just not sure which is more important, or if either 
should be mandatory:
* An indication of wether or not hte input is already hashed
** reading up more on HLL i'm realizing how important it is that the values be 
hashed (into longs).
** We should certainly support on the fly hashing, but for people who plan to 
compute cardinalities a lot, particularly over large sets or strings, we should 
also have both:
*** an easy way for them to compute those long hashes at index time (simple 
UpdateProcessor)
*** a stats localparam indicate that the field they are computing cardinality 
over is already hashed
* precisions / size tunning
** similar to how we have an optional "tdigestCompression" param we could have 
an "hllOptions" param for overriding the "log2m" and "regwidth" options
** or we could require that the value of the "cardinality" param be a value 
indicating how much the user cares about accuracy vs ram (ie: a float between 0 
and 1 indicating min ram vs max accurace) and compute log2m+regwidth from those 
("false" or negative values could disable complete, while "true" could be 
shorthand for some default)
*** this would have the benefit of being something we could continue to support 
even if a better cardinality algorithm comes along in the future

My next steps are to focus on more concrete tests & then refactoring to work 
with other field types, and think about the knobs/configuration as i go.

> add hyperloglog in statscomponent as an approximate count
> ---------------------------------------------------------
>
>                 Key: SOLR-6968
>                 URL: https://issues.apache.org/jira/browse/SOLR-6968
>             Project: Solr
>          Issue Type: Sub-task
>            Reporter: Hoss Man
>         Attachments: SOLR-6968.patch
>
>
> stats component currently supports "calcDistinct" but it's terribly 
> inefficient -- especially in distib mode.
> we should add support for using hyperloglog to compute an approximate count 
> of distinct values (using localparams via SOLR-6349 to control the precision 
> of the approximation)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-6968) add hyperloglog in statscomponent as an approximate count

Reply via email to