[ 
https://issues.apache.org/jira/browse/SOLR-7461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14528822#comment-14528822
 ] 

Hoss Man commented on SOLR-7461:
--------------------------------

James: sorry, i overlooked your comments until now...

bq. Regarding the hyperloglog approach:

please raise HLL related discussions in the comments of SOLR-6968 - been making 
lots of progress there, would love your opinions & help testing, but i don't 
want to have the conversation fractured in two diff places for other people 
watching that issue.

----

bq. Unfortunately, for me, being able to get exact (or very nearly exact) 
distinct counts is important for the application I'm building (it's an 
analytics / stats tool so correct numbers are a must have feature)

I understand ... the reason i popped back up in this issue today was to comment 
that: 

1) as I work more with HLL and really think about the internals, i realized 
that for people who care a lot about absolutely accurate precision, especially 
in low cardinality situations, i think you are correct -- fixing this issue 
(countdistinct w/o distinctValues) would be a big win -- even once HLL support 
is added.  The simple reason being that HLL is fundamentally based on counting 
unique *hashed* values, so even if someone tries to tune it to be as accurate 
as possible, there's still the risk that two distinct values in a users's set 
will have the same hash value and prevent 100% accurate cardinality results.
2) as i worked on refactoring the tests in SOLR-6968 it occurred to me that 
what you're requesting here would actually be fairly easy to implement and 
support in a backcompat way -- in particular because of the changes already 
made in SOLR-6349 (distinguishing per-shard dependent stats) and the fact that 
the current param is "calcdistinct" but the stats currently returned are 
"countDistinct" and "distinctValues"

In a nutshell, what i think we should do is...

* replace {{calcdistinct(true)}} value in the {{Stat}} enum with 
{{distinctvals(true)}} and {{countdistinct(false,distinctvals)}} (so 
countdistinct indicates that in a distributed setup it requires distinctvals to 
be returned by each shard)
* remove all the calcDistinct logic from {{StatsValuesFactory}} and instead 
treat {{countdistinct}} and {{distinctvals}} as independently _requested_ stats 
that have a dependent distributed computation relationship (just like "mean" vs 
"sum / count")
* update the existing special syntactic sugar parsing for the top level 
"calcdistinct" param to also include syntatic sugar parsing for the 
calcdistinct local param - in both cases they become sugar for 
{{countdistinct=true distinctvals=true}}

the end result being that users who want the {{countDistinct}} response info, 
w/o the list of all {{distinctVals}} would just specify {{countdistinct=true}} 
as a localparam (no other new top level params needed)

How does that sound?

these changes should already be fairly trivial to make if you want to take a 
stab at a patch -- but there may be some patch collision in the code with the 
work i'm doing in SOLR-6968. There will definitely be patch collisions in the 
existing test cases, but once SOLR-6968 is commited the test changes should 
actually be _much_ easier ... and i'm happy to try and tackle this as soon as 
i'm done with that one (if nothing else, it will help me do a better apples to 
apples benchmark of HLL vs countDistinct w/o the added distincatVals in the 
final result)


> StatsComponent, calcdistinct, ability to disable distinctValues while keeping 
> countDistinct
> -------------------------------------------------------------------------------------------
>
>                 Key: SOLR-7461
>                 URL: https://issues.apache.org/jira/browse/SOLR-7461
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: James Andres
>              Labels: statscomponent
>
> When using calcdistinct with large amounts of data the distinctValues field 
> can be extremely large. In cases where the countDistinct is only required it 
> would be helpful if the server did not return distinctValues in the response.
> I'm no expert, but here are some ideas for how the syntax could look.
> {code}
> # Both countDistinct and distinctValues are returned, along with all other 
> stats
> stats.calcdistinct=true&stats.field=myfield
> # Only countDistinct and distinctValues are returned
> stats.calcdistinct=true&stats.field={!countDistinct=true 
> distinctValues=true}myfield
> # Only countDistinct is returned
> stats.calcdistinct=true&stats.field={!countDistinct=true}myfield
> # Only distinctValues is returned
> stats.calcdistinct=true&stats.field={!distinctValues=true}myfield
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to