Re: Handling node failure in ES cluster

2014-07-21 Thread Mark Walkom
Max and min memory should be the same, mlockall is probably not working due
to these being different as it can't lock a sliding window.
Try setting that and see if it helps.

Also you didn't mention your java version and release, which would be
helpful.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com
web: www.campaignmonitor.com


On 22 July 2014 02:38, kmoore.cce kmoore@gmail.com wrote:

 I have had some issues recently as I've expanded my ES cluster, where a
 single node failure causes basically all other index/search operations to
 timeout and fail.

 I am currently running elasticsearch v1.2.1 and primarily interface with
 the indices using the elasticsearch python module.

 My cluster is 20 nodes, each an m1.large ec2 instance. I currently have
 ~18 indices each with 5 shards and 3 replicas. The average size of each
 index is ~20GB and ~10 million documents (low is ~100K documents (300mb),
 high ~40 million (35gb)).
 I run each node with ES_MAX_SIZE=4g and ES_MIN_SIZE=512m. There are no
 other services running on the elasticsearch nodes, except ssh. I use zen
 unicast discovery with a set list of nodes. I have tried to enable
 'bootstrap.mlockall', but the ulimit settings do not seem to be working and
 I keep getting 'Unable to lock JVM memory (ENOMEM)' errors when starting a
 node (note: I didn't see this log message when running 0.90.7).

 I have a fairly constant series of new or updated documents (I don't
 actually update, but rather reindex when a new document with the same id is
 found) that are being ingested all the time, and a number of users who are
 querying the data on a regular basis - most queries are set queries through
 the python API.

 The issue I have now is that while data is being ingested/indexed, I will
 hit Java heap out of memory errors. I think this is related to garbage
 collection as that seems to be the last activity in the logs nearly
 everytime this occurs. I have tried adjusting the heap max to 6g, and that
 seems to help but I am not sure it solves the issue. In conjunction with
 that, when the out of memory error occurs it seems to cause the other nodes
 to stop working effectively, timeout errors in both indexing and searching.

 My question is: what is the best way to support a node failing for this
 reason? I would obviously like to solve the underlying problem as well, but
 I would also like to be able to support a node crashing for some reason
 (whether it be because of me or because ec2 took it away). Shouldn't the
 failover in replicas support the missing node? I understand the cluster
 state would be yellow at this time, but I should be able to index and
 search data on the remaining nodes, correct?

 Are there configuration changes I can make to better support the cluster
 and identify or solve the underyling issue?

 Any help is appreciated. I understand I have a lot to learn about
 Elasticsearch, but I am hoping I can add some stability/resiliency to my
 cluster.

 Thanks in advance,
 -Kevin

 --
 You received this message because you are subscribed to the Google Groups
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to elasticsearch+unsubscr...@googlegroups.com.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/elasticsearch/74ac48ec-0c05-4683-9c78-66d8c97687fa%40googlegroups.com
 https://groups.google.com/d/msgid/elasticsearch/74ac48ec-0c05-4683-9c78-66d8c97687fa%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAEM624a1bjybx6b-B-7h%2BkVVy-JPEEvs0_9JaY-wbcLS5hPFhw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Handling node failure in ES cluster

2014-07-21 Thread Otis Gospodnetic
Lots of things could be the source of problems here.  Maybe you can tune 
the JVM params.  We don't know what you are using or what your GC activity 
looks like.  Can you share GC metrics graphs?  If you don't have any GC 
monitoring, you can use SPM http://sematext.com/spm/.  Why do you have 5 
shards for all indices?  Some seem small and shouldn't need to be sharded 
so much.  Why do you have 3 replicas and not, say, just 2? (we don't know 
your query rates).

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Monday, July 21, 2014 12:38:49 PM UTC-4, kmoore.cce wrote:

 I have had some issues recently as I've expanded my ES cluster, where a 
 single node failure causes basically all other index/search operations to 
 timeout and fail.

 I am currently running elasticsearch v1.2.1 and primarily interface with 
 the indices using the elasticsearch python module.

 My cluster is 20 nodes, each an m1.large ec2 instance. I currently have 
 ~18 indices each with 5 shards and 3 replicas. The average size of each 
 index is ~20GB and ~10 million documents (low is ~100K documents (300mb), 
 high ~40 million (35gb)).
 I run each node with ES_MAX_SIZE=4g and ES_MIN_SIZE=512m. There are no 
 other services running on the elasticsearch nodes, except ssh. I use zen 
 unicast discovery with a set list of nodes. I have tried to enable 
 'bootstrap.mlockall', but the ulimit settings do not seem to be working and 
 I keep getting 'Unable to lock JVM memory (ENOMEM)' errors when starting a 
 node (note: I didn't see this log message when running 0.90.7).

 I have a fairly constant series of new or updated documents (I don't 
 actually update, but rather reindex when a new document with the same id is 
 found) that are being ingested all the time, and a number of users who are 
 querying the data on a regular basis - most queries are set queries through 
 the python API.

 The issue I have now is that while data is being ingested/indexed, I will 
 hit Java heap out of memory errors. I think this is related to garbage 
 collection as that seems to be the last activity in the logs nearly 
 everytime this occurs. I have tried adjusting the heap max to 6g, and that 
 seems to help but I am not sure it solves the issue. In conjunction with 
 that, when the out of memory error occurs it seems to cause the other nodes 
 to stop working effectively, timeout errors in both indexing and searching.

 My question is: what is the best way to support a node failing for this 
 reason? I would obviously like to solve the underlying problem as well, but 
 I would also like to be able to support a node crashing for some reason 
 (whether it be because of me or because ec2 took it away). Shouldn't the 
 failover in replicas support the missing node? I understand the cluster 
 state would be yellow at this time, but I should be able to index and 
 search data on the remaining nodes, correct?

 Are there configuration changes I can make to better support the cluster 
 and identify or solve the underyling issue? 

 Any help is appreciated. I understand I have a lot to learn about 
 Elasticsearch, but I am hoping I can add some stability/resiliency to my 
 cluster.

 Thanks in advance,
 -Kevin


-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/9eb495e8-9ac6-4ef0-95ae-a6cc4516c67a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.