Re: Handling node failure in ES cluster

Otis Gospodnetic Mon, 21 Jul 2014 21:02:25 -0700

Lots of things could be the source of problems here.  Maybe you can tune 
the JVM params.  We don't know what you are using or what your GC activity 
looks like.  Can you share GC metrics graphs?  If you don't have any GC 
monitoring, you can use SPM <http://sematext.com/spm/>.  Why do you have 5 
shards for all indices?  Some seem small and shouldn't need to be sharded 
so much.  Why do you have 3 replicas and not, say, just 2? (we don't know 
your query rates).


Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Monday, July 21, 2014 12:38:49 PM UTC-4, kmoore.cce wrote:
>
> I have had some issues recently as I've expanded my ES cluster, where a 
> single node failure causes basically all other index/search operations to 
> timeout and fail.
>
> I am currently running elasticsearch v1.2.1 and primarily interface with 
> the indices using the elasticsearch python module.
>
> My cluster is 20 nodes, each an m1.large ec2 instance. I currently have 
> ~18 indices each with 5 shards and 3 replicas. The average size of each 
> index is ~20GB and ~10 million documents (low is ~100K documents (300mb), 
> high ~40 million (35gb)).
> I run each node with ES_MAX_SIZE=4g and ES_MIN_SIZE=512m. There are no 
> other services running on the elasticsearch nodes, except ssh. I use zen 
> unicast discovery with a set list of nodes. I have tried to enable 
> 'bootstrap.mlockall', but the ulimit settings do not seem to be working and 
> I keep getting 'Unable to lock JVM memory (ENOMEM)' errors when starting a 
> node (note: I didn't see this log message when running 0.90.7).
>
> I have a fairly constant series of new or updated documents (I don't 
> actually update, but rather reindex when a new document with the same id is 
> found) that are being ingested all the time, and a number of users who are 
> querying the data on a regular basis - most queries are set queries through 
> the python API.
>
> The issue I have now is that while data is being ingested/indexed, I will 
> hit Java heap out of memory errors. I think this is related to garbage 
> collection as that seems to be the last activity in the logs nearly 
> everytime this occurs. I have tried adjusting the heap max to 6g, and that 
> seems to help but I am not sure it solves the issue. In conjunction with 
> that, when the out of memory error occurs it seems to cause the other nodes 
> to stop working effectively, timeout errors in both indexing and searching.
>
> My question is: what is the best way to support a node failing for this 
> reason? I would obviously like to solve the underlying problem as well, but 
> I would also like to be able to support a node crashing for some reason 
> (whether it be because of me or because ec2 took it away). Shouldn't the 
> failover in replicas support the missing node? I understand the cluster 
> state would be yellow at this time, but I should be able to index and 
> search data on the remaining nodes, correct?
>
> Are there configuration changes I can make to better support the cluster 
> and identify or solve the underyling issue? 
>
> Any help is appreciated. I understand I have a lot to learn about 
> Elasticsearch, but I am hoping I can add some stability/resiliency to my 
> cluster.
>
> Thanks in advance,
> -Kevin
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/9eb495e8-9ac6-4ef0-95ae-a6cc4516c67a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Handling node failure in ES cluster

Reply via email to