Hello,
Our team recently upgraded from ES 1.1.2 to 1.3.2 and are happy with the improvements ... except for one perplexing situation. We are running on Azure worker roles with Oracle Java 1.8u11 and using the G1 gc. It is possible this is due to G1, but please consider all of the data below before you pull out a pat response on g1. Our cluster has 18 nodes, 3 of which are dedicated masters. We have three indexes, 5 shards and one replica each. The primary index is about 30gb total (5.9gb per shard and the shards are the same size). We have five types in the main index and are about 10 fields each, a mix of strings, dates, bools, longs. None of the strings are analyzed. All of the 18 nodes are client nodes and Azure is set up to round robin requests. We have considered creating dedicated client nodes, but haven't done so yet. The query I have been using is a combination of a non-trivial filter, a terms aggregation and two sum aggregations nested beneath the terms aggregation: { "query": { "filtered": { "filter": { "bool": { … } } } }, "aggs": { "name1": { "terms": { "field": "stringfield1" }, "aggs": { "sum1": { "sum": { "field": "longfield1" } }, "sum2 ": { "sum": { "field": "longfield2" } } } } } } I have run the tests on the cluster when it was lightly loaded (some indexing plus lightweight metrics queries) and run the tests when there was no load. I’ll be the first to admit I can be even more systematic, but the results I have are consistent enough and hard to explain enough that wanted to write this community. The primary test uses a filter which always results in an empty set. The filter contains two must terms, one must range and three mustnot terms. Since I only care about the aggregation results, this is a search_type=count query. If I run the query/filter without the aggregations, the time taken (results.took from ES) is ~0 (sometimes as high as 15ms). That makes sense. The case that doesn’t make sense is that I run the same filter on the same cluster under the same condition this time WITH the aggregations and I get anywhere from 200ms to 40000ms. Yes, a factor of 200x. I could believe 200ms to account for some overhead of the aggregations machinery, but 40000ms? And there is no pattern that I can tell as to when 200ms is returned .vs. 40000ms. Given that Azure round robins the queries, I can imagine that depending on which nodes are involved, the query might take more or less of the time. In fact, I would expect some variations. The other piece of data is that in trying to debug this I restarted ES on some of the nodes. By the time I had restarted the third node the query/filter + all agregations case now returned 200ms consistently. My question is how it is possible for an empty filter + aggregations to result in 40000ms time. I tried the same filter and only the terms aggregation (not the sums); the result was in the 3500-4000ms range – in case that matters. Hopefully this makes sense to someone. I’m pulling my hair out and my colleagues on our internal ES alias are stumped as well. Thanks for any help, Craig. -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7f16f9e6-52e7-4d5c-854a-a7bd409e2040%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.