Thanks for your quick reply!

On Tue, Nov 25, 2014 at 6:41 PM, Adrien Grand <
adrien.gr...@elasticsearch.com> wrote:

> On Tue, Nov 25, 2014 at 2:29 PM, Dror Atariah <dror...@gmail.com> wrote:
>
>> 1) For me, the documentation is still somehow confusing, and the
>> difference between the *cardinality* and *value_count* aggregations is
>> not 100% clear.
>>
>
> I have to agree here... If you have suggestions to make it less confusing,
> ideas are highly welcome (even changing the name of the aggs might be an
> option if we do it in a major release).
>
Well, name changing is problematic, due to backwards compatibilities and
should be exercised only as the last resort. Beforehand, I'd suggest to add
a section, common to the two aggregations, where there's a *single
(minimal)* example that demonstrated the differences.


> 2) When it comes to counting unique values: I believe that the only way
>> that one can take, at the moment, is to use the *cardinality* aggregation.
>> This, however, comes with the price of an approximated result (as discussed
>> in the documentation and in the paper describing HyperLogLog++). I
>> understand the need to take an approximating approach; but I think that the
>> returned result should indicate a bound on the error. Otherwise, the
>> returned count could be considered useless. In the documentation the figure
>> 5% is mentioned --- is it independent of the cardinality? what happens to
>> this bound when the precision threshold is >> 40,000?
>>
>
> This is true, only the cardinality aggregation allows to compute unique
> counts.
>
> The thing about the error is that there is no bound on it, but higher
> errors are less likely. The only thing we *might* be able to return would
> be a condifence interval, but it requires some work... Regarding the 5%
> that are mentioned in the documentation, it was just meant as an example to
> show that in spite of the approximate approach, results are very close to
> accurate. A precision_threshold above 40000 is basically the same as a
> precision_threshold of 40000.
>

If there's no theoretical bound, then I guess the best one can hope for is
the probability that the returned value is outside an \epsilon interval
(what you probably refer to as "confidence interval"). This would be great,
not to say absolutely necessary. When a data scientist presents his work,
the business (narrow minded) guys want to know the numbers... :)

Furthermore, since ES aims for big data, it is not clear to me how one can
come up with the 40,000 figure. After all, if the number of unique values
is in the order of 1K or 100M, then the threshold cannot be the same... can
it?

Best,
Dror

PS: thanks for the interesting discussion!


>
> --
> Adrien Grand
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "elasticsearch" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/elasticsearch/cy59hCNnT0Q/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6-s1dM%2BuYpLDTn_tFfpxevYZmu_3_zvaRiXKwuZi2vOw%40mail.gmail.com
> <https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6-s1dM%2BuYpLDTn_tFfpxevYZmu_3_zvaRiXKwuZi2vOw%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Dror Atariah, Ph.D.
de.linkedin.com/in/atariah

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CANfRcg3pHpTUMcbdgUuETk-LC0Z%3DOkA8fWzVh1BUZB7iULjH_w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to