I am seeing some erroneous behavior in my ES cluster when performing 
aggregations.  Originally, I thought this was specific to a histogram as 
that is where the error first appeared (in a K3 graph - see my post 
https://groups.google.com/forum/#!topic/elasticsearch/iY-lKjtW7PM for 
reference) but I have been able to re-create the exception with a simple 
max aggregation.  The details are as follows:

ES Version: 1.4.4
Topology: 5 nodes, 5 shards per index, 2 replicas
OS: Redhat Linux

To create the issue I execute the following query against the cluster:

{
  "query": {
    "term": {
      "metric": "used"
    }
  },
  "aggs": {
    "max_val": {
      "max": {
        "field": "metric_value"
      }
    }
  }
}

Upon executing this query multiple times, I get different responses.  One 
time I get the expected result:
...
"took": 13,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 11712,
"max_score": 9.361205,
...
"aggregations": { "max_val": { "value": 18096380}}

whereas on another request with the same query I get the following bad 
response:

"took": 8,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 11712,
"max_score": 9.361205,
...
"aggregations": { "max_val": { "value": 4697741490703565000}}



Some possibly relevant observations:
1.  In my first set of tests, I was consistently getting the correct 
results for the first 2 requests and the bad result on the 3rd request 
(with no one else executing this query at that point in time)
2.  Flushing the cache did not correct the issue
3.  I reduced the number of replicas to 0 and was consistently getting the 
same result (which happened to be the correct one)
4.  After increasing the replica count back to 2 and waiting until ES 
reported that the replication was complete, I tried the same experiment.  
This time, the 1st request retrieved the correct result and the next 2 
requests retrieved incorrect results.  In this case the incorrect results 
were not the same but were both huge and of the same order of magnitude.


Other info:
- The size of the index was about 3.3Gb with ~ 50M documents in it
- This is one of many date based indices (i.e. similar to the logstash 
index setup), but the only one in this installation that exhibited the 
issue.  I believe we saw something similar in a UAT environment as well 
where 1 or 2 of the indices acted in this weird manner
- ES reported the entire cluster as green


It seems that some shard(s)/replica(s) were being corrupted on the 
replication and we were being routed to that one every 3rd hit.  (Is this 
somehow correlated to the number of replicas?)

So, my questions are:

1. Has anyone seen this type of behavior before?  
2. Can it somehow be data dependent?
3. Is there any way to figure out what happened/what is happening?
4. Why does ES report the cluster state as green?
5. How can I debug this?
6. How can I prevent/correct this?


Any and all help/pointers would be greatly appreciated.

Thanks in advance,
MC

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/2461a3f0-aee4-45f7-9210-3ef3524b12c5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to