Hello, I have a question about cluster recovery after the cluster goes into an unhealthy state. Let's assume the following.
We have a cluster with 9 nodes. 3 master nodes (esmX) (master=true, data=false) 4 data nodes (esdX) (master=false, data=true) 2 client nodes (escX) (master=false, data=false) minimum_master_nodes is set to 2. The cluster is deployed across multiple racks. rack 1 esm1, esm2, esd1, esd2 and esc1 rack2 esm3, esd3, esd4 and esc2 With this configuration I can lose rack 2 and the cluster still fulfills the requirements to form a proper cluster. If I would loose rack 1 forever or a long time, I would manual spin up a second master node in rack 2 that to fulfill 2 minimum masters. If now the network connection between the 2 racks fails, the cluster goes in an unhealthy state. After a while rack 1 will be back online and everything is working again. I noticed that this takes up to many minutes. Even after playing with the timeout settings for failure detection it takes relative long until it thinks that the other nodes are gone and before it's back to normal. My question is, is that normal? Do I have to live with a few minutes downtime if parts of the cluster becomes unreachable? Or are there any options I could still try to tune? Thanks Marco -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8f9233b5-bc4c-47f0-8a42-7d38db8dc7fb%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
