To follow up, I have a contained test suite at https://gist.github.com/thanodnl/8803745for this problem. It contains two files:
1. aggsbug.sh 2. aggsbug.json The .json file contains ~1M documents newline separated to load into the database, I was not able to create a curl request to load them directly into the index. The .sh file (https://gist.github.com/thanodnl/8803745/raw/aggsbug.sh) contains the instructions for recreating this behavior. I have ran these against the following version: 1. 1.0.0.Beta2 2. 1.0.0.RC1 3. 1.0.0-SNAPSHOT as compiled from the git 1.0 branch on commit 0f8b41ffad9b5ecdfd543d7c73edcf404e6fc763 When ran on 1.0.0.Beta2 it gives the same output consistently when I run the _search over and over again. When ran on 1.0.0.RC1 it will give me multiple different outcomes comparable to the numbers I posted earlier in the thread, When ran on 1.0.0-SNAPSHOT it behaves the same as in 1.0.0.RC1. That it still was working on 1.0.0.Beta2 proves to me that it is a bug that got into RC1. I could not find any related ticket on the issues page of the github repository. Hopefully this is enough information to recreate the problem. The json file is quite big and could bug when you open the gist it in a browser. A clone of the gist locally will work best: $ git clone https://gist.github.com/8803745.git I do not really know how to move on from here. Do you want me to open an issue for this problem at github.com/elasticsearch/elasticsearch? It would be nice to fix this problem before a release of 1.0.0 since that is the first release containing the aggregations for analytics. On Tuesday, February 4, 2014 12:31:10 PM UTC+1, Nils Dijk wrote: > I've loaded the same dataset in ES1.0.0.Beta2 with the same index > configuration as in the topic start. > > However now the numbers are consistent if I call the same aggregation > multiple times in a row AND the number match the numbers of the facets. > This leads me to the conclusion something is broken from Beta2 to RC1! > > I would like to test this on master, but I could not find any nightly > builds of elasticsearch. Is there a location where they are stored or > should I compile it myself? > > On Friday, January 31, 2014 6:43:07 PM UTC+1, Nils Dijk wrote: >> >> Hi Binh Ly, >> >> Thanks for the response. >> >> I'm aware that the numbers are not exact (hence the link to issue #1305 >> in my initial post), and have been advocating slightly incorrect numbers >> with my colleges and customers for some time already to prepare them for >> the moment we provide analytics with ES. But what bothers me is that they >> are *inconsistent*. >> >> If you look at my gist you see that I ran the same aggs 3 times right >> after each other. If we just look at the top item we see the following >> results: >> >> 1. { "key": "totaltrafficbos", "doc_count": 2880 } >> 2. { "key": "totaltrafficbos", "doc_count": 2552 } >> 3. { "key": "totaltrafficbos", "doc_count": 2179 } >> >> These results are taken within seconds without any change to the number of >> documents in the index. If I run them even more you see that it rotates >> between a hand full of numbers. Is this also behavior one would expect from >> the aggs? And if so, why do the facets show the same number over and over >> again? >> >> Anyway, I will try to work myself through the aggs code this weekend to get >> a better hang of what we could do with it, and what not. >> >> -- Nils >> >> On Friday, January 31, 2014 6:18:43 PM UTC+1, Binh Ly wrote: >>> >>> Nils, >>> >>> This is just the nature of splitting data around in shards. Actually the >>> terms facet has the same limitations (i.e. it will also give "approximate >>> counts"). Neither the terms facet nor the terms aggregation is better or >>> worse than the other - they are both approximations (using different >>> implementations). It is correct that if you put all your data in 1 shard, >>> then all the counts are exact. If you need to shard, you can increase the >>> "shard_size" parameter inside the terms aggregation to "improve accuracy". >>> Play with that number until it suits your purposes but the important thing >>> is they are just approximations the more documents you have in the index - >>> so just don't expect absolute numbers from them if you have more than 1 >>> shard. >>> >>> { >>> "size": 0, >>> "aggs": { >>> "a": { >>> "terms": { >>> "field": "actor.displayName", >>> "shard_size": 10000 >>> } >>> } >>> } >>> } >>> >> -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fb421a29-8923-4188-9363-03682fec71ab%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.