[ https://issues.apache.org/jira/browse/SOLR-7461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14528822#comment-14528822 ]
Hoss Man commented on SOLR-7461: -------------------------------- James: sorry, i overlooked your comments until now... bq. Regarding the hyperloglog approach: please raise HLL related discussions in the comments of SOLR-6968 - been making lots of progress there, would love your opinions & help testing, but i don't want to have the conversation fractured in two diff places for other people watching that issue. ---- bq. Unfortunately, for me, being able to get exact (or very nearly exact) distinct counts is important for the application I'm building (it's an analytics / stats tool so correct numbers are a must have feature) I understand ... the reason i popped back up in this issue today was to comment that: 1) as I work more with HLL and really think about the internals, i realized that for people who care a lot about absolutely accurate precision, especially in low cardinality situations, i think you are correct -- fixing this issue (countdistinct w/o distinctValues) would be a big win -- even once HLL support is added. The simple reason being that HLL is fundamentally based on counting unique *hashed* values, so even if someone tries to tune it to be as accurate as possible, there's still the risk that two distinct values in a users's set will have the same hash value and prevent 100% accurate cardinality results. 2) as i worked on refactoring the tests in SOLR-6968 it occurred to me that what you're requesting here would actually be fairly easy to implement and support in a backcompat way -- in particular because of the changes already made in SOLR-6349 (distinguishing per-shard dependent stats) and the fact that the current param is "calcdistinct" but the stats currently returned are "countDistinct" and "distinctValues" In a nutshell, what i think we should do is... * replace {{calcdistinct(true)}} value in the {{Stat}} enum with {{distinctvals(true)}} and {{countdistinct(false,distinctvals)}} (so countdistinct indicates that in a distributed setup it requires distinctvals to be returned by each shard) * remove all the calcDistinct logic from {{StatsValuesFactory}} and instead treat {{countdistinct}} and {{distinctvals}} as independently _requested_ stats that have a dependent distributed computation relationship (just like "mean" vs "sum / count") * update the existing special syntactic sugar parsing for the top level "calcdistinct" param to also include syntatic sugar parsing for the calcdistinct local param - in both cases they become sugar for {{countdistinct=true distinctvals=true}} the end result being that users who want the {{countDistinct}} response info, w/o the list of all {{distinctVals}} would just specify {{countdistinct=true}} as a localparam (no other new top level params needed) How does that sound? these changes should already be fairly trivial to make if you want to take a stab at a patch -- but there may be some patch collision in the code with the work i'm doing in SOLR-6968. There will definitely be patch collisions in the existing test cases, but once SOLR-6968 is commited the test changes should actually be _much_ easier ... and i'm happy to try and tackle this as soon as i'm done with that one (if nothing else, it will help me do a better apples to apples benchmark of HLL vs countDistinct w/o the added distincatVals in the final result) > StatsComponent, calcdistinct, ability to disable distinctValues while keeping > countDistinct > ------------------------------------------------------------------------------------------- > > Key: SOLR-7461 > URL: https://issues.apache.org/jira/browse/SOLR-7461 > Project: Solr > Issue Type: Improvement > Reporter: James Andres > Labels: statscomponent > > When using calcdistinct with large amounts of data the distinctValues field > can be extremely large. In cases where the countDistinct is only required it > would be helpful if the server did not return distinctValues in the response. > I'm no expert, but here are some ideas for how the syntax could look. > {code} > # Both countDistinct and distinctValues are returned, along with all other > stats > stats.calcdistinct=true&stats.field=myfield > # Only countDistinct and distinctValues are returned > stats.calcdistinct=true&stats.field={!countDistinct=true > distinctValues=true}myfield > # Only countDistinct is returned > stats.calcdistinct=true&stats.field={!countDistinct=true}myfield > # Only distinctValues is returned > stats.calcdistinct=true&stats.field={!distinctValues=true}myfield > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org