Hello,

Our team recently upgraded from ES 1.1.2 to 1.3.2 and are happy with the 
improvements ... except for one perplexing situation.

We are running on Azure worker roles with Oracle Java 1.8u11 and using the 
G1 gc.  It is possible this is due to G1, but please consider all of the 
data below before you pull out a pat response on g1.  

Our cluster has 18 nodes, 3 of which are dedicated masters.  We have three 
indexes, 5 shards and one replica each.  The primary index is about 30gb 
total (5.9gb per shard and the shards are the same size).  We have five 
types in the main index and are about 10 fields each, a mix of strings, 
dates, bools, longs.  None of the strings are analyzed.

All of the 18 nodes are client nodes and Azure is set up to round robin 
requests.  We have considered creating dedicated client nodes, but haven't 
done so yet.

The query I have been using is a combination of a non-trivial filter, a 
terms aggregation and two sum aggregations nested beneath the terms 
aggregation:

{ "query": { "filtered": { "filter": { "bool": { … } } } },

    "aggs": { "name1": { "terms": { "field": "stringfield1" },

            "aggs": { "sum1": { "sum": { "field": "longfield1" } },

                "sum2 ": { "sum": { "field": "longfield2" } } } } } }

I have run the tests on the cluster when it was lightly loaded (some 
indexing plus lightweight metrics queries) and run the tests when there was 
no load.  I’ll be the first to admit I can be even more systematic, but the 
results I have are consistent enough and hard to explain enough that wanted 
to write this community.

The primary test uses a filter which always results in an empty set.  The 
filter contains two must terms, one must range and three mustnot terms.  Since 
I only care about the aggregation results, this is a search_type=count 
query.

If I run the query/filter without the aggregations, the time taken 
(results.took from ES) is ~0 (sometimes as high as 15ms).   That makes 
sense.

The case that doesn’t make sense is that I run the same filter on the same 
cluster under the same condition this time WITH the aggregations and I get 
anywhere from 200ms to 40000ms.  Yes, a factor of 200x.  I could believe 
200ms to account for some overhead of the aggregations machinery, but 
40000ms?  And there is no pattern that I can tell as to when 200ms is 
returned .vs. 40000ms. 

Given that Azure round robins the queries, I can imagine that depending on 
which nodes are involved, the query might take more or less of the time.  In 
fact, I would expect some variations.

The other piece of data is that in trying to debug this I restarted ES on 
some of the nodes.  By the time I had restarted the third node the 
query/filter + all agregations case now returned 200ms consistently.

My question is how it is possible for an empty filter + aggregations to 
result in 40000ms time.  I tried the same filter and only the terms 
aggregation (not the sums); the result was in the 3500-4000ms range – in 
case that matters.

Hopefully this makes sense to someone.  I’m pulling my hair out and my 
colleagues on our internal ES alias are stumped as well.

Thanks for any help,

Craig.

 

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/7f16f9e6-52e7-4d5c-854a-a7bd409e2040%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to