Re: Accuracy on cardinality aggregate

2014-11-25 Thread Dror Atariah
Thanks for your quick reply!

On Tue, Nov 25, 2014 at 6:41 PM, Adrien Grand <
adrien.gr...@elasticsearch.com> wrote:

> On Tue, Nov 25, 2014 at 2:29 PM, Dror Atariah  wrote:
>
>> 1) For me, the documentation is still somehow confusing, and the
>> difference between the *cardinality* and *value_count* aggregations is
>> not 100% clear.
>>
>
> I have to agree here... If you have suggestions to make it less confusing,
> ideas are highly welcome (even changing the name of the aggs might be an
> option if we do it in a major release).
>
Well, name changing is problematic, due to backwards compatibilities and
should be exercised only as the last resort. Beforehand, I'd suggest to add
a section, common to the two aggregations, where there's a *single
(minimal)* example that demonstrated the differences.


> 2) When it comes to counting unique values: I believe that the only way
>> that one can take, at the moment, is to use the *cardinality* aggregation.
>> This, however, comes with the price of an approximated result (as discussed
>> in the documentation and in the paper describing HyperLogLog++). I
>> understand the need to take an approximating approach; but I think that the
>> returned result should indicate a bound on the error. Otherwise, the
>> returned count could be considered useless. In the documentation the figure
>> 5% is mentioned --- is it independent of the cardinality? what happens to
>> this bound when the precision threshold is >> 40,000?
>>
>
> This is true, only the cardinality aggregation allows to compute unique
> counts.
>
> The thing about the error is that there is no bound on it, but higher
> errors are less likely. The only thing we *might* be able to return would
> be a condifence interval, but it requires some work... Regarding the 5%
> that are mentioned in the documentation, it was just meant as an example to
> show that in spite of the approximate approach, results are very close to
> accurate. A precision_threshold above 4 is basically the same as a
> precision_threshold of 4.
>

If there's no theoretical bound, then I guess the best one can hope for is
the probability that the returned value is outside an \epsilon interval
(what you probably refer to as "confidence interval"). This would be great,
not to say absolutely necessary. When a data scientist presents his work,
the business (narrow minded) guys want to know the numbers... :)

Furthermore, since ES aims for big data, it is not clear to me how one can
come up with the 40,000 figure. After all, if the number of unique values
is in the order of 1K or 100M, then the threshold cannot be the same... can
it?

Best,
Dror

PS: thanks for the interesting discussion!


>
> --
> Adrien Grand
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "elasticsearch" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/elasticsearch/cy59hCNnT0Q/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6-s1dM%2BuYpLDTn_tFfpxevYZmu_3_zvaRiXKwuZi2vOw%40mail.gmail.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Dror Atariah, Ph.D.
de.linkedin.com/in/atariah

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CANfRcg3pHpTUMcbdgUuETk-LC0Z%3DOkA8fWzVh1BUZB7iULjH_w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Accuracy on cardinality aggregate

2014-11-25 Thread Adrien Grand
Hi Dror,

On Tue, Nov 25, 2014 at 2:29 PM, Dror Atariah  wrote:

> Hi Adrien,
>
> I have two comments/questions:
>
> 1) For me, the documentation is still somehow confusing, and the
> difference between the *cardinality* and *value_count* aggregations is
> not 100% clear.
>

I have to agree here... If you have suggestions to make it less confusing,
ideas are highly welcome (even changing the name of the aggs might be an
option if we do it in a major release).


> 2) When it comes to counting unique values: I believe that the only way
> that one can take, at the moment, is to use the *cardinality* aggregation.
> This, however, comes with the price of an approximated result (as discussed
> in the documentation and in the paper describing HyperLogLog++). I
> understand the need to take an approximating approach; but I think that the
> returned result should indicate a bound on the error. Otherwise, the
> returned count could be considered useless. In the documentation the figure
> 5% is mentioned --- is it independent of the cardinality? what happens to
> this bound when the precision threshold is >> 40,000?
>

This is true, only the cardinality aggregation allows to compute unique
counts.

The thing about the error is that there is no bound on it, but higher
errors are less likely. The only thing we *might* be able to return would
be a condifence interval, but it requires some work... Regarding the 5%
that are mentioned in the documentation, it was just meant as an example to
show that in spite of the approximate approach, results are very close to
accurate. A precision_threshold above 4 is basically the same as a
precision_threshold of 4.

-- 
Adrien Grand

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6-s1dM%2BuYpLDTn_tFfpxevYZmu_3_zvaRiXKwuZi2vOw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Accuracy on cardinality aggregate

2014-11-25 Thread Dror Atariah
Hi Adrien,

I have two comments/questions:

1) For me, the documentation is still somehow confusing, and the difference 
between the *cardinality* and *value_count* aggregations is not 100% clear.

2) When it comes to counting unique values: I believe that the only way 
that one can take, at the moment, is to use the *cardinality* aggregation. 
This, however, comes with the price of an approximated result (as discussed 
in the documentation and in the paper describing HyperLogLog++). I 
understand the need to take an approximating approach; but I think that the 
returned result should indicate a bound on the error. Otherwise, the 
returned count could be considered useless. In the documentation the figure 
5% is mentioned --- is it independent of the cardinality? what happens to 
this bound when the precision threshold is >> 40,000?

Thanks for your time,
Dror

On Tuesday, April 1, 2014 9:50:30 AM UTC+2, Adrien Grand wrote:
>
> Hi Henrik,
>
> Indeed, there is no way to compute exact unique counts. The reason why we 
> don't expose such a feature is that it would be very costly. In your case, 
> the cardinality is not too large so the terms aggregation helped compute 
> the number of unique values but if the actual cardinality had been very 
> large (eg. 100M), it is very likely that trying to use the terms agg to do 
> so would have required a lot of memory (maybe triggering out-of-memory 
> errors on your nodes), been very slow and caused a lot of network traffic. 
> We will try to clarify this through documentation or a blog post soon.
>
> Thanks for trying out this new aggregation!
>
>
>
> On Mon, Mar 31, 2014 at 11:09 PM, Henrik Nordvik  > wrote:
>
>> Ah, so there is currently not easy way of getting exact unique counts out 
>> of elasticsearch?
>>
>> I found a manual way of doing it:
>>
>> curl -s 'http://localhost:9200/twitter-2014.03.26/_search' -d '{ 
>> "facets": { "a": {  "terms": { "field": "screen_name", "size": 
>> 20},"facet_filter": {"query": {"term": {"lang": "en"},"size": 0}' | 
>> ./jq '.facets.a.terms | length'
>> 145474 (vs 145541)
>> curl -s 'http://localhost:9200/twitter-2014.03.26/_search' -d '{ 
>> "facets": { "a": {  "terms": { "field": "screen_name", "size": 
>> 20},"facet_filter": {"query": {"term": {"lang": "ja"},"size": 0}' | 
>> ./jq '.facets.a.terms | length'
>> 50949 (vs 50824)
>>
>> So the count is quite close! Thank you.
>>
>>
>>
>> On Friday, March 28, 2014 10:32:55 PM UTC+1, Binh Ly wrote:
>>>
>>> value_count is the total number of values extracted per bucket. This 
>>> example might help:
>>>
>>> https://gist.github.com/bly2k/9843335
>>>
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/8669e9f0-eece-4b77-8e99-fec483359e2f%40googlegroups.com
>>  
>> 
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> -- 
> Adrien Grand
>  

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/96f6d854-466b-46a2-8387-64e785db95e9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Accuracy on cardinality aggregate

2014-04-01 Thread Adrien Grand
Hi Henrik,

Indeed, there is no way to compute exact unique counts. The reason why we
don't expose such a feature is that it would be very costly. In your case,
the cardinality is not too large so the terms aggregation helped compute
the number of unique values but if the actual cardinality had been very
large (eg. 100M), it is very likely that trying to use the terms agg to do
so would have required a lot of memory (maybe triggering out-of-memory
errors on your nodes), been very slow and caused a lot of network traffic.
We will try to clarify this through documentation or a blog post soon.

Thanks for trying out this new aggregation!



On Mon, Mar 31, 2014 at 11:09 PM, Henrik Nordvik  wrote:

> Ah, so there is currently not easy way of getting exact unique counts out
> of elasticsearch?
>
> I found a manual way of doing it:
>
> curl -s 'http://localhost:9200/twitter-2014.03.26/_search' -d '{
> "facets": { "a": {  "terms": { "field": "screen_name", "size":
> 20},"facet_filter": {"query": {"term": {"lang": "en"},"size": 0}' |
> ./jq '.facets.a.terms | length'
> 145474 (vs 145541)
> curl -s 'http://localhost:9200/twitter-2014.03.26/_search' -d '{
> "facets": { "a": {  "terms": { "field": "screen_name", "size":
> 20},"facet_filter": {"query": {"term": {"lang": "ja"},"size": 0}' |
> ./jq '.facets.a.terms | length'
> 50949 (vs 50824)
>
> So the count is quite close! Thank you.
>
>
>
> On Friday, March 28, 2014 10:32:55 PM UTC+1, Binh Ly wrote:
>>
>> value_count is the total number of values extracted per bucket. This
>> example might help:
>>
>> https://gist.github.com/bly2k/9843335
>>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/8669e9f0-eece-4b77-8e99-fec483359e2f%40googlegroups.com
> .
>
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Adrien Grand

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7Qxe0SJSfFreK%3DfpqSBfziLzTVoGgi-T73J1YDx6ApTQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Accuracy on cardinality aggregate

2014-03-31 Thread Henrik Nordvik
Ah, so there is currently not easy way of getting exact unique counts out 
of elasticsearch?

I found a manual way of doing it:

curl -s 'http://localhost:9200/twitter-2014.03.26/_search' -d '{ "facets": 
{ "a": {  "terms": { "field": "screen_name", "size": 
20},"facet_filter": {"query": {"term": {"lang": "en"},"size": 0}' | 
./jq '.facets.a.terms | length'
145474 (vs 145541)
curl -s 'http://localhost:9200/twitter-2014.03.26/_search' -d '{ "facets": 
{ "a": {  "terms": { "field": "screen_name", "size": 
20},"facet_filter": {"query": {"term": {"lang": "ja"},"size": 0}' | 
./jq '.facets.a.terms | length'
50949 (vs 50824)

So the count is quite close! Thank you.


On Friday, March 28, 2014 10:32:55 PM UTC+1, Binh Ly wrote:
>
> value_count is the total number of values extracted per bucket. This 
> example might help:
>
> https://gist.github.com/bly2k/9843335
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/8669e9f0-eece-4b77-8e99-fec483359e2f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Accuracy on cardinality aggregate

2014-03-28 Thread Binh Ly
value_count is the total number of values extracted per bucket. This 
example might help:

https://gist.github.com/bly2k/9843335

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/49e9b196-548a-4e8b-86ed-87857d1973d5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Accuracy on cardinality aggregate

2014-03-28 Thread Henrik Nordvik
I compared the unique count with the total field of the old terms facet and
it matched. What else would the count be? It is lower than doc count.
On 28 Mar 2014 18:54, "Mark Harwood"  wrote:

> I don't believe value_count is intended to be a unique count.
>
>
>
> On Friday, March 28, 2014 7:17:47 AM UTC, Henrik Nordvik wrote:
>>
>> Hi,
>> I'm trying out the new cardinality aggregation, and want to measure the
>> accuracy on my data. I'm using a dataset of a day of sample tweets (2.8m
>> tweets).
>>
>> I'm counting the number of unique usernames per language.
>> To get my "reference" unique count I use this:
>> GET /twitter-2014.03.26/_search
>> {
>>   "size": 0,
>>   "aggs": {
>> "country_count": {
>>   "terms": {
>> "field": "lang"
>>   },
>>   "aggs": {
>>"unique_count" : { "value_count" : { "field" : "screen_name" } }
>>   }
>> }
>>   }
>> }
>>
>> Result:
>>   "aggregations": {
>>   "country_count": {
>>  "buckets": [
>> {
>>"key": "en",
>>"doc_count": 872906,
>>"unique_count": {
>>   "value": 307489
>>}
>> },
>> {
>>"key": "ja",
>>"doc_count": 581521,
>>"unique_count": {
>>   "value": 103035
>>}
>> },
>>
>>
>> To get the approximate count with cardinality:
>> GET /twitter-2014.03.26/_search
>> {
>>   "size": 0,
>>   "aggs": {
>> "country_count": {
>>   "terms": {
>> "field": "lang"
>>   },
>>   "aggregations": {
>> "distinct_users_approx": {
>>   "cardinality": {
>> "field": "screen_name",
>> "precision_threshold": 4
>>   }
>> }
>>   }
>> }
>>   }
>> }
>>
>> Result:
>>"aggregations": {
>>   "country_count": {
>>  "buckets": [
>> {
>>"key": "en",
>>"doc_count": 872906,
>>"distinct_users_approx": {
>>   "value": 145541
>>}
>> },
>> {
>>"key": "ja",
>>"doc_count": 581521,
>>"distinct_users_approx": {
>>   "value": 50824
>>}
>> },
>>
>> So, 307489 vs 145541 for english, and 103035 vs 50824 for japanese. Not
>> very accurate.
>>
>> 1) Am I doing the reference unique count distinct correctly?
>> 2) Is it supposed to be this inaccurate on this type of dataset?
>> 3) Is there any way to improve precision?
>>
>> -
>> Henrik
>>
>  --
> You received this message because you are subscribed to a topic in the
> Google Groups "elasticsearch" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/elasticsearch/cy59hCNnT0Q/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/b323f916-81ff-4e98-baa2-e3b0f84fa28e%40googlegroups.com
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAH3vNzN9ftYTJEnAo3si1GKJk0e2qc%2BRoApXmXB2CB_6bT%3Dysw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Accuracy on cardinality aggregate

2014-03-28 Thread Mark Harwood
I don't believe value_count is intended to be a unique count.



On Friday, March 28, 2014 7:17:47 AM UTC, Henrik Nordvik wrote:
>
> Hi,
> I'm trying out the new cardinality aggregation, and want to measure the 
> accuracy on my data. I'm using a dataset of a day of sample tweets (2.8m 
> tweets).
>
> I'm counting the number of unique usernames per language.
> To get my "reference" unique count I use this:
> GET /twitter-2014.03.26/_search
> {
>   "size": 0,
>   "aggs": {
> "country_count": {
>   "terms": {
> "field": "lang"
>   },
>   "aggs": {
>"unique_count" : { "value_count" : { "field" : "screen_name" } }
>   }
> }
>   }
> }
>
> Result:
>   "aggregations": {
>   "country_count": {
>  "buckets": [
> {
>"key": "en",
>"doc_count": 872906,
>"unique_count": {
>   "value": 307489
>}
> },
> {
>"key": "ja",
>"doc_count": 581521,
>"unique_count": {
>   "value": 103035
>}
> },
>
>
> To get the approximate count with cardinality:
> GET /twitter-2014.03.26/_search
> {
>   "size": 0,
>   "aggs": {
> "country_count": {
>   "terms": {
> "field": "lang"
>   },
>   "aggregations": {
> "distinct_users_approx": {
>   "cardinality": {
> "field": "screen_name",
> "precision_threshold": 4
>   }
> }
>   }
> }
>   }
> }
>
> Result:
>"aggregations": {
>   "country_count": {
>  "buckets": [
> {
>"key": "en",
>"doc_count": 872906,
>"distinct_users_approx": {
>   "value": 145541
>}
> },
> {
>"key": "ja",
>"doc_count": 581521,
>"distinct_users_approx": {
>   "value": 50824
>}
> },
>
> So, 307489 vs 145541 for english, and 103035 vs 50824 for japanese. Not 
> very accurate.
>
> 1) Am I doing the reference unique count distinct correctly?
> 2) Is it supposed to be this inaccurate on this type of dataset?
> 3) Is there any way to improve precision?
>
> -
> Henrik
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/b323f916-81ff-4e98-baa2-e3b0f84fa28e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.