To follow up,

I have a contained test suite at https://gist.github.com/thanodnl/8803745for 
this problem. It contains two files:

   1. aggsbug.sh
   2. aggsbug.json

The .json file contains ~1M documents newline separated to load into the 
database, I was not able to create a curl request to load them directly 
into the index.
The .sh file (https://gist.github.com/thanodnl/8803745/raw/aggsbug.sh) 
contains the instructions for recreating this behavior.

I have ran these against the following version:

   1. 1.0.0.Beta2
   2. 1.0.0.RC1
   3. 1.0.0-SNAPSHOT as compiled from the git 1.0 branch on commit 
   0f8b41ffad9b5ecdfd543d7c73edcf404e6fc763

When ran on 1.0.0.Beta2 it gives the same output consistently when I run 
the _search over and over again.
When ran on 1.0.0.RC1 it will give me multiple different outcomes 
comparable to the numbers I posted earlier in the thread,
When ran on 1.0.0-SNAPSHOT it behaves the same as in 1.0.0.RC1.

That it still was working on 1.0.0.Beta2 proves to me that it is a bug that 
got into RC1. I could not find any related ticket on the issues page of the 
github repository. Hopefully this is enough information to recreate the 
problem.

The json file is quite big and could bug when you open the gist it in a 
browser. A clone of the gist locally will work best:
$ git clone https://gist.github.com/8803745.git

I do not really know how to move on from here. Do you want me to open an 
issue for this problem at github.com/elasticsearch/elasticsearch? It would 
be nice to fix this problem before a release of 1.0.0 since that is the 
first release containing the aggregations for analytics.

On Tuesday, February 4, 2014 12:31:10 PM UTC+1, Nils Dijk wrote:

> I've loaded the same dataset in ES1.0.0.Beta2 with the same index 
> configuration as in the topic start.
>
> However now the numbers are consistent if I call the same aggregation 
> multiple times in a row AND the number match the numbers of the facets. 
> This leads me to the conclusion something is broken from Beta2 to RC1!
>
> I would like to test this on master, but I could not find any nightly 
> builds of elasticsearch. Is there a location where they are stored or 
> should I compile it myself?
>
> On Friday, January 31, 2014 6:43:07 PM UTC+1, Nils Dijk wrote:
>>
>> Hi Binh Ly,
>>
>> Thanks for the response.
>>
>> I'm aware that the numbers are not exact (hence the link to issue #1305 
>> in my initial post), and have been advocating slightly incorrect numbers 
>> with my colleges and customers for some time already to prepare them for 
>> the moment we provide analytics with ES. But what bothers me is that they 
>> are *inconsistent*.
>>
>> If you look at my gist you see that I ran the same aggs 3 times right 
>> after each other. If we just look at the top item we see the following 
>> results:
>>
>>    1. { "key": "totaltrafficbos", "doc_count": 2880 }
>>    2. { "key": "totaltrafficbos", "doc_count": 2552 }
>>    3. { "key": "totaltrafficbos", "doc_count": 2179 }
>>    
>> These results are taken within seconds without any change to the number of 
>> documents in the index. If I run them even more you see that it rotates 
>> between a hand full of numbers. Is this also behavior one would expect from 
>> the aggs? And if so, why do the facets show the same number over and over 
>> again?
>>
>> Anyway, I will try to work myself through the aggs code this weekend to get 
>> a better hang of what we could do with it, and what not.
>>
>> -- Nils
>>
>> On Friday, January 31, 2014 6:18:43 PM UTC+1, Binh Ly wrote:
>>>
>>> Nils,
>>>
>>> This is just the nature of splitting data around in shards. Actually the 
>>> terms facet has the same limitations (i.e. it will also give "approximate 
>>> counts"). Neither the terms facet nor the terms aggregation is better or 
>>> worse than the other - they are both approximations (using different 
>>> implementations). It is correct that if you put all your data in 1 shard, 
>>> then all the counts are exact. If you need to shard, you can increase the 
>>> "shard_size" parameter inside the terms aggregation to "improve accuracy". 
>>> Play with that number until it suits your purposes but the important thing 
>>> is they are just approximations the more documents you have in the index - 
>>> so just don't expect absolute numbers from them if you have more than 1 
>>> shard.
>>>
>>> {
>>>   "size": 0,
>>>   "aggs": {
>>>     "a": {
>>>       "terms": {
>>>         "field": "actor.displayName",
>>>         "shard_size": 10000
>>>       }
>>>     }
>>>   }
>>> }
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/fb421a29-8923-4188-9363-03682fec71ab%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to