I don't believe value_count is intended to be a unique count.


On Friday, March 28, 2014 7:17:47 AM UTC, Henrik Nordvik wrote:
>
> Hi,
> I'm trying out the new cardinality aggregation, and want to measure the 
> accuracy on my data. I'm using a dataset of a day of sample tweets (2.8m 
> tweets).
>
> I'm counting the number of unique usernames per language.
> To get my "reference" unique count I use this:
> GET /twitter-2014.03.26/_search
> {
>   "size": 0,
>   "aggs": {
>     "country_count": {
>       "terms": {
>         "field": "lang"
>       },
>       "aggs": {
>        "unique_count" : { "value_count" : { "field" : "screen_name" } }
>       }
>     }
>   }
> }
>
> Result:
>   "aggregations": {
>       "country_count": {
>          "buckets": [
>             {
>                "key": "en",
>                "doc_count": 872906,
>                "unique_count": {
>                   "value": 307489
>                }
>             },
>             {
>                "key": "ja",
>                "doc_count": 581521,
>                "unique_count": {
>                   "value": 103035
>                }
>             },
>
>
> To get the approximate count with cardinality:
> GET /twitter-2014.03.26/_search
> {
>   "size": 0,
>   "aggs": {
>     "country_count": {
>       "terms": {
>         "field": "lang"
>       },
>       "aggregations": {
>         "distinct_users_approx": {
>           "cardinality": {
>             "field": "screen_name",
>             "precision_threshold": 40000
>           }
>         }
>       }
>     }
>   }
> }
>
> Result:
>    "aggregations": {
>       "country_count": {
>          "buckets": [
>             {
>                "key": "en",
>                "doc_count": 872906,
>                "distinct_users_approx": {
>                   "value": 145541
>                }
>             },
>             {
>                "key": "ja",
>                "doc_count": 581521,
>                "distinct_users_approx": {
>                   "value": 50824
>                }
>             },
>
> So, 307489 vs 145541 for english, and 103035 vs 50824 for japanese. Not 
> very accurate.
>
> 1) Am I doing the reference unique count distinct correctly?
> 2) Is it supposed to be this inaccurate on this type of dataset?
> 3) Is there any way to improve precision?
>
> -
> Henrik
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/b323f916-81ff-4e98-baa2-e3b0f84fa28e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to