Re: aggregation giving inconsistent results

2014-10-31 Thread Jay Hilden
Thanks Adrien, your link was very helpful in understanding why I was
getting the results I'm getting.  Doing some experimentation on our data
I'm going to use a 20x multiplier on the shard_count against the size.  So
in my testing when I want the top 5 results for a very flat term I'm going
to set shard_size to 100 (5*20) and that is giving me accurate results.

Thanks again!

On Fri, Oct 31, 2014 at 3:44 AM, Adrien Grand <
adrien.gr...@elasticsearch.com> wrote:

> This is unfortunately a known limitation of the terms aggregation. Note
> however that elasticsearch 1.4 (only a beta version is available today but
> the GA release should be available within a couple of weeks) improves some
> heuristics which allow to have a better accuracy by default, and also
> reports an error bound on the document counts that are returned.
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/1.x/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-approximate-counts
>
> On Thu, Oct 30, 2014 at 5:48 PM, Jay Hilden  wrote:
>
>> I'm running an aggregation and getting the top 5 results.  When I run the
>> exact same aggregation on the top 50 results I'm getting totally different
>> results.  I expect that when asking for 50 the top 5 should remain the same
>> and an additional 45 should be added to the list.  That is not what's
>> happening.
>>
>> Note: I'm aggregating on the non_analyzed part of a multi-field
>> authInput.userName, I'm not sure if that makes a difference or not.
>>
>> *Here is my query: *
>>
>> GET prodstarbucks/authEvent/_search
>> {
>>   "size": 0,
>>   "aggs": {
>> "users": {
>>   "terms": {
>> "field": "authInput.userName.userNameNotAnalyzed",
>> "size": 5
>>   }
>> }
>>   },
>>   "query": {
>> "filtered": {
>>   "query": {
>> "match_all": {}
>>   },
>>   "filter": {
>> "bool": {
>>   "must": [
>> {
>>   "range": {
>> "authResult.authEventDate": {
>>   "gte": "2014-10-01T00:00:00.000",
>>   "lte": "2014-10-31T00:00:00.000"
>> }
>>   }
>> }
>>   ]
>> }
>>   }
>> }
>>   }
>> }
>>
>> *RESULT:*
>> {
>>"took": 2171,
>>"timed_out": false,
>>"_shards": {
>>   "total": 5,
>>   "successful": 5,
>>   "failed": 0
>>},
>>"hits": {
>>   "total": 1090455,
>>   "max_score": 0,
>>   "hits": []
>>},
>>"aggregations": {
>>   "users": {
>>  "buckets": [
>> {
>>"key": "3D64E4FD-6D25-4E77-966E-A0E059CFD31E",
>>"doc_count": 91
>> },
>> {
>>"key": "3982EC96-DB4C-4A22-AC64-2CFC09D52579",
>>"doc_count": 68
>> },
>> {
>>"key": "674E6691-8A46-4D34-BB31-B78780969311",
>>"doc_count": 24
>> },
>> {
>>"key": "64449480-77AC-4D64-B79D-DDB545BEE472",
>>"doc_count": 23
>> },
>> {
>>"key": "{7CB63FEE-709A-4AD5-AA16-2CFE3282FEE8}",
>>"doc_count": 23
>> }
>>  ]
>>   }
>>}
>> }
>>
>> If I change the aggregation size to be 50, these are my top 5 results:
>> {
>>"took": 2256,
>>"timed_out": false,
>>"_shards": {
>>   "total": 5,
>>   "successful": 5,
>>   "failed": 0
>>},
>>"hits": {
>>   "total": 1090501,
>>   "max_score": 0,
>>   "hits": []
>>},
>>"aggregations": {
>>   "users": {
>>  "buckets": [
>> {
>>"key": "3D64E4FD-6D25-4E77-966E-A0E059CFD31E",
>>"doc_count": 109
>> },
>> {
>>"key": "3982EC96-DB4C-4A22-AC64-2CFC09D52579",
>>"doc_count": 84
>> },
>> {
>>"key": "F77E8291-1640-4C3F-8A1A-D6D955AB940A",
>>"doc_count": 59
>> },
>> {
>>"key": "6AC1ED48-8F91-400B-9353-172BB6E1823B",
>>"doc_count": 53
>> },
>> {
>>"key": "52CDF454-92C2-4C32-91F6-AF4F08C8F908",
>>"doc_count": 52
>> },
>> ...
>>
>>
>> The doc_counts are all different.  Can someone help explain this to me
>> and let me know how I might get the correct doc_count even when only asking
>> for the top 5 results.
>>
>> Thank you!
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to elasticsearch+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/3e7e5a69-59ee-4472-abb5-598258f97341%40googlegroups.com
>> 

Re: aggregation giving inconsistent results

2014-10-31 Thread Adrien Grand
This is unfortunately a known limitation of the terms aggregation. Note
however that elasticsearch 1.4 (only a beta version is available today but
the GA release should be available within a couple of weeks) improves some
heuristics which allow to have a better accuracy by default, and also
reports an error bound on the document counts that are returned.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/1.x/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-approximate-counts

On Thu, Oct 30, 2014 at 5:48 PM, Jay Hilden  wrote:

> I'm running an aggregation and getting the top 5 results.  When I run the
> exact same aggregation on the top 50 results I'm getting totally different
> results.  I expect that when asking for 50 the top 5 should remain the same
> and an additional 45 should be added to the list.  That is not what's
> happening.
>
> Note: I'm aggregating on the non_analyzed part of a multi-field
> authInput.userName, I'm not sure if that makes a difference or not.
>
> *Here is my query: *
>
> GET prodstarbucks/authEvent/_search
> {
>   "size": 0,
>   "aggs": {
> "users": {
>   "terms": {
> "field": "authInput.userName.userNameNotAnalyzed",
> "size": 5
>   }
> }
>   },
>   "query": {
> "filtered": {
>   "query": {
> "match_all": {}
>   },
>   "filter": {
> "bool": {
>   "must": [
> {
>   "range": {
> "authResult.authEventDate": {
>   "gte": "2014-10-01T00:00:00.000",
>   "lte": "2014-10-31T00:00:00.000"
> }
>   }
> }
>   ]
> }
>   }
> }
>   }
> }
>
> *RESULT:*
> {
>"took": 2171,
>"timed_out": false,
>"_shards": {
>   "total": 5,
>   "successful": 5,
>   "failed": 0
>},
>"hits": {
>   "total": 1090455,
>   "max_score": 0,
>   "hits": []
>},
>"aggregations": {
>   "users": {
>  "buckets": [
> {
>"key": "3D64E4FD-6D25-4E77-966E-A0E059CFD31E",
>"doc_count": 91
> },
> {
>"key": "3982EC96-DB4C-4A22-AC64-2CFC09D52579",
>"doc_count": 68
> },
> {
>"key": "674E6691-8A46-4D34-BB31-B78780969311",
>"doc_count": 24
> },
> {
>"key": "64449480-77AC-4D64-B79D-DDB545BEE472",
>"doc_count": 23
> },
> {
>"key": "{7CB63FEE-709A-4AD5-AA16-2CFE3282FEE8}",
>"doc_count": 23
> }
>  ]
>   }
>}
> }
>
> If I change the aggregation size to be 50, these are my top 5 results:
> {
>"took": 2256,
>"timed_out": false,
>"_shards": {
>   "total": 5,
>   "successful": 5,
>   "failed": 0
>},
>"hits": {
>   "total": 1090501,
>   "max_score": 0,
>   "hits": []
>},
>"aggregations": {
>   "users": {
>  "buckets": [
> {
>"key": "3D64E4FD-6D25-4E77-966E-A0E059CFD31E",
>"doc_count": 109
> },
> {
>"key": "3982EC96-DB4C-4A22-AC64-2CFC09D52579",
>"doc_count": 84
> },
> {
>"key": "F77E8291-1640-4C3F-8A1A-D6D955AB940A",
>"doc_count": 59
> },
> {
>"key": "6AC1ED48-8F91-400B-9353-172BB6E1823B",
>"doc_count": 53
> },
> {
>"key": "52CDF454-92C2-4C32-91F6-AF4F08C8F908",
>"doc_count": 52
> },
> ...
>
>
> The doc_counts are all different.  Can someone help explain this to me and
> let me know how I might get the correct doc_count even when only asking for
> the top 5 results.
>
> Thank you!
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/3e7e5a69-59ee-4472-abb5-598258f97341%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Adrien Grand

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7Qp%3DCAKSqe1H9zY87fy4T2UBoNvjh4tYpgZNoLpPbkaw%40mail.gmail.com.
For more options, visi

aggregation giving inconsistent results

2014-10-30 Thread Jay Hilden
I'm running an aggregation and getting the top 5 results.  When I run the 
exact same aggregation on the top 50 results I'm getting totally different 
results.  I expect that when asking for 50 the top 5 should remain the same 
and an additional 45 should be added to the list.  That is not what's 
happening.

Note: I'm aggregating on the non_analyzed part of a multi-field 
authInput.userName, I'm not sure if that makes a difference or not.

*Here is my query: * 

GET prodstarbucks/authEvent/_search
{
  "size": 0,
  "aggs": {
"users": {
  "terms": {
"field": "authInput.userName.userNameNotAnalyzed",
"size": 5
  }
}
  },
  "query": {
"filtered": {
  "query": {
"match_all": {}
  },
  "filter": {
"bool": {
  "must": [
{
  "range": {
"authResult.authEventDate": {
  "gte": "2014-10-01T00:00:00.000",
  "lte": "2014-10-31T00:00:00.000"
}
  }
}
  ]
}
  }
}
  }
}

*RESULT:*
{
   "took": 2171,
   "timed_out": false,
   "_shards": {
  "total": 5,
  "successful": 5,
  "failed": 0
   },
   "hits": {
  "total": 1090455,
  "max_score": 0,
  "hits": []
   },
   "aggregations": {
  "users": {
 "buckets": [
{
   "key": "3D64E4FD-6D25-4E77-966E-A0E059CFD31E",
   "doc_count": 91
},
{
   "key": "3982EC96-DB4C-4A22-AC64-2CFC09D52579",
   "doc_count": 68
},
{
   "key": "674E6691-8A46-4D34-BB31-B78780969311",
   "doc_count": 24
},
{
   "key": "64449480-77AC-4D64-B79D-DDB545BEE472",
   "doc_count": 23
},
{
   "key": "{7CB63FEE-709A-4AD5-AA16-2CFE3282FEE8}",
   "doc_count": 23
}
 ]
  }
   }
}

If I change the aggregation size to be 50, these are my top 5 results:
{
   "took": 2256,
   "timed_out": false,
   "_shards": {
  "total": 5,
  "successful": 5,
  "failed": 0
   },
   "hits": {
  "total": 1090501,
  "max_score": 0,
  "hits": []
   },
   "aggregations": {
  "users": {
 "buckets": [
{
   "key": "3D64E4FD-6D25-4E77-966E-A0E059CFD31E",
   "doc_count": 109
},
{
   "key": "3982EC96-DB4C-4A22-AC64-2CFC09D52579",
   "doc_count": 84
},
{
   "key": "F77E8291-1640-4C3F-8A1A-D6D955AB940A",
   "doc_count": 59
},
{
   "key": "6AC1ED48-8F91-400B-9353-172BB6E1823B",
   "doc_count": 53
},
{
   "key": "52CDF454-92C2-4C32-91F6-AF4F08C8F908",
   "doc_count": 52
},
...


The doc_counts are all different.  Can someone help explain this to me and 
let me know how I might get the correct doc_count even when only asking for 
the top 5 results.

Thank you!

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/3e7e5a69-59ee-4472-abb5-598258f97341%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.