Incorrect Aggregations returned from ES

MC Wed, 25 Mar 2015 15:57:29 -0700

I am seeing some erroneous behavior in my ES cluster when performing 
aggregations.  Originally, I thought this was specific to a histogram as 
that is where the error first appeared (in a K3 graph - see my post 
https://groups.google.com/forum/#!topic/elasticsearch/iY-lKjtW7PM for 
reference) but I have been able to re-create the exception with a simple 
max aggregation.  The details are as follows:

ES Version: 1.4.4
Topology: 5 nodes, 5 shards per index, 2 replicas
OS: Redhat Linux

To create the issue I execute the following query against the cluster:

{
"query": {
"term": {
"metric": "used"
}
},
"aggs": {
"max_val": {
"max": {
"field": "metric_value"
}
}
}
}

Upon executing this query multiple times, I get different responses. One
time I get the expected result:
...
"took": 13,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 11712,
"max_score": 9.361205,
...
"aggregations": { "max_val": { "value": 18096380}}

whereas on another request with the same query I get the following bad
response:

"took": 8,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 11712,
"max_score": 9.361205,
...
"aggregations": { "max_val": { "value": 4697741490703565000}}

Some possibly relevant observations:
1. In my first set of tests, I was consistently getting the correct
results for the first 2 requests and the bad result on the 3rd request
(with no one else executing this query at that point in time)
2. Flushing the cache did not correct the issue
3. I reduced the number of replicas to 0 and was consistently getting the
same result (which happened to be the correct one)
4. After increasing the replica count back to 2 and waiting until ES
reported that the replication was complete, I tried the same experiment.
This time, the 1st request retrieved the correct result and the next 2
requests retrieved incorrect results. In this case the incorrect results
were not the same but were both huge and of the same order of magnitude.

Other info:
- The size of the index was about 3.3Gb with ~ 50M documents in it
- This is one of many date based indices (i.e. similar to the logstash
index setup), but the only one in this installation that exhibited the
issue. I believe we saw something similar in a UAT environment as well
where 1 or 2 of the indices acted in this weird manner
- ES reported the entire cluster as green

It seems that some shard(s)/replica(s) were being corrupted on the
replication and we were being routed to that one every 3rd hit. (Is this
somehow correlated to the number of replicas?)

So, my questions are:

1. Has anyone seen this type of behavior before?
2. Can it somehow be data dependent?
3. Is there any way to figure out what happened/what is happening?
4. Why does ES report the cluster state as green?
5. How can I debug this?
6. How can I prevent/correct this?

Any and all help/pointers would be greatly appreciated.

Thanks in advance,
MC

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2461a3f0-aee4-45f7-9210-3ef3524b12c5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Incorrect Aggregations returned from ES

Reply via email to