[ https://issues.apache.org/jira/browse/SOLR-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hoss Man updated SOLR-6968: --------------------------- Attachment: SOLR-6968.patch Updated patch with more tests. My current TODO list... {noformat} - 6 < regwidth makes no sense? - even at (min) log2m==4, isn't regwidth==6 big enough for all possible (hashed) long values? - prehashed support - need to sanity/error check that the field is a long - add an update processor to make this easy to do at index time - tunning knobs - memory vs accuracy (log2m) - idea: (least ram) 0 < accuracy < 1 (most accurate) - scale - max cardinality estimatable (regwidth) - perhaps hardcode regwidth==6 ? expert only option to adjust? - pick regwidth based on field type? (int/enum have fewer in general) - pick regwidth based on index stats? max out based on total terms in field? - or for single valued fields: max out based on numDocs - HLL must use same hash seed, but does it support union when log2m and regwidth are diff? - convinience equivilence with countDistinct in solrj response obj ? {noformat} > add hyperloglog in statscomponent as an approximate count > --------------------------------------------------------- > > Key: SOLR-6968 > URL: https://issues.apache.org/jira/browse/SOLR-6968 > Project: Solr > Issue Type: Sub-task > Reporter: Hoss Man > Attachments: SOLR-6968.patch, SOLR-6968.patch, SOLR-6968.patch > > > stats component currently supports "calcDistinct" but it's terribly > inefficient -- especially in distib mode. > we should add support for using hyperloglog to compute an approximate count > of distinct values (using localparams via SOLR-6349 to control the precision > of the approximation) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org