Re: Indices not recovering after elasticsearch upgrade (1.0.2 - 1.4.1)
Anyone? On Mon, Dec 1, 2014 at 11:22 AM, Michel Conrad michel.con...@trendiction.com wrote: Hi, I just updated our test environment from 1.0.2 to 1.4.1 and some indices failed to recover, which seems to be related to the checksum verfication introduces in 1.3. [2014-11-28 09:40:48,019][WARN ][cluster.action.shard ] [NODE1] [index][0] received shard failed for [index][0], node[CWq_uCPhRKqGEAvtS1jkug], [P], s[INITIALIZING], indexUUID [yJBShgqGQgi0q5NbMms0Sg], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[index][0] failed to fetch index version after copying it over]; nested: CorruptIndexException[[index][0] Preexisting corrupted index [corrupted_JysmZSaLRXWN_BgqpRSo6Q] caused by: CorruptIndexException[checksum failed (hardware problem?) : expected=16ncx91 actual=1xc6e7g resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@1afc89e8)] org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=16ncx91 actual=1xc6e7g resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@1afc89e8) at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73) at org.elasticsearch.index.store.Store.verify(Store.java:365) at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599) at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536) at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) In order to get the indices to recover I check them using org.apache.lucene.index.CheckIndex, the indices seemed ok, as no error was reported. Reopening the indices did not solve the issue. After deleting the checksums file as well as the corrupted_XXX marker file, the indices finally recovered correctly. I suppose that the verfication step here is simply skipped as there are no checksums to compare against. I am currently trying to understand the issue. Might it be that the checksums file itself might have been corrupted. Also, while I did not see any direct consequences of deleting the checksums file, I just want to be sure that deleting them does not cause any issues. Any thoughts or help is greatly appreciated, Michel -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAH0sEYiv5UuLBaxJXkzJsoFaro93Kf4%2B2_WmjjuboFKvfQcUHA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
Indices not recovering after elasticsearch upgrade (1.0.2 - 1.4.1)
Hi, I just updated our test environment from 1.0.2 to 1.4.1 and some indices failed to recover, which seems to be related to the checksum verfication introduces in 1.3. [2014-11-28 09:40:48,019][WARN ][cluster.action.shard ] [NODE1] [index][0] received shard failed for [index][0], node[CWq_uCPhRKqGEAvtS1jkug], [P], s[INITIALIZING], indexUUID [yJBShgqGQgi0q5NbMms0Sg], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[index][0] failed to fetch index version after copying it over]; nested: CorruptIndexException[[index][0] Preexisting corrupted index [corrupted_JysmZSaLRXWN_BgqpRSo6Q] caused by: CorruptIndexException[checksum failed (hardware problem?) : expected=16ncx91 actual=1xc6e7g resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@1afc89e8)] org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=16ncx91 actual=1xc6e7g resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@1afc89e8) at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73) at org.elasticsearch.index.store.Store.verify(Store.java:365) at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599) at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536) at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) In order to get the indices to recover I check them using org.apache.lucene.index.CheckIndex, the indices seemed ok, as no error was reported. Reopening the indices did not solve the issue. After deleting the checksums file as well as the corrupted_XXX marker file, the indices finally recovered correctly. I suppose that the verfication step here is simply skipped as there are no checksums to compare against. I am currently trying to understand the issue. Might it be that the checksums file itself might have been corrupted. Also, while I did not see any direct consequences of deleting the checksums file, I just want to be sure that deleting them does not cause any issues. Any thoughts or help is greatly appreciated, Michel -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAH0sEYgS7ts_t%3DFHkBvHk0vyt_NXDsE_v4iLergwP0g0sy6kGw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
Re: Disk-based shard allocation per node settings
What if I set the option in the configuration file, will it then only be applied to the node reading the configuration file? On Fri, May 16, 2014 at 10:36 AM, Mark Walkom ma...@campaignmonitor.com wrote: It's a cluster level setting, so all nodes use the same one. As far as I know, there is no node level setting. Regards, Mark Walkom Infrastructure Engineer Campaign Monitor email: ma...@campaignmonitor.com web: www.campaignmonitor.com On 16 May 2014 17:43, Michel Conrad michel.con...@trendiction.com wrote: Is it possible to specify different thresholds for different nodes in the cluster? I could not find it in the documentation, when updating the setting through the cluster settings as described on http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-allocation.html#disk the threshold is applied to every node. When specifing the thresholds through the configuration file, will it then be only applied to the node reading the configuration? In this case how to I prevent the cluster settings being applied and overwriting the configuration file (remove a setting). Thanks for any thoughts, Michel -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAH0sEYgM4%3DmH2cA3wJoBQAmx82ddrz6N%3D8u3XE0EFoKoOwTUUQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624a6tk9jqGkh_ED%3Dx_bpJPxpiydP9_KxTugrBVZuX5yaCw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAH0sEYjc5DAO4mW%2B%3DnrgDfxbEd%2BYxwcn1vZ9jXxBNm1N%2BH9%3D1Q%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
Re: Slow cluster startup with zen discovery and large number of nodes.
Also the kernel complains about too many connections being made at once on the joining node. (Seems to occur after 30 nodes joined the cluster) TCP: TCP: Possible SYN flooding on port 9300. Sending cookies. Check SNMP counters. On Fri, Feb 21, 2014 at 6:44 PM, Thibaut Britz t.br...@trendiction.com wrote: Hi, I'm working with Michel on that issue: The cluster is completely empty and has no indexes at all. So it certainly is not related to revocery. The old elasticsearch version doesn't have code to wait for replies which causes the very slow startup. Thanks, Thibaut On Fri, Feb 21, 2014 at 6:11 PM, Binh Ly b...@hibalo.com wrote: You may be interested in some settings that help a full cluster restart: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-gateway.html#recover-after There is also a webinar that talks about some of the above: http://www.elasticsearch.org/webinars/elasticsearch-pre-flight-checklist/ -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a405d32a-50a5-4d83-a3c7-8a0ea3449d28%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAH0sEYjyTwFZFP7ZPCsOquSJDrYmdTrrhkmpD5XxbKpv-YO85A%40mail.gmail.com. For more options, visit https://groups.google.com/groups/opt_out.
Slow cluster startup with zen discovery and large number of nodes.
Hi, Starting a cluster with 100 nodes takes half an hour just for the nodes to join in elasticsearch version 1.0. In version 0.19.8 nodes were very quick to join the cluster. The issue seems to come from the master node sending the updated state to all the nodes in the cluster after every single addition of a node and then waiting for the nodes to acknowledge the cluster update before adding the next node (zen-disco-receive). Setting discovery.zen.publish_timeout:0 seems to resolve the issue during startup, because the master node does not block anymore, but I am not sure if something can go wrong afterwards while running the cluster with the timeout set to 0. I also tried setting increasing the kernel connections, but it did not make a difference: sysctl -w net.ipv4.tcp_max_syn_backlog=20480 sysctl -w net.core.somaxconn=8192 sysctl -w net.ipv4.tcp_syncookies=1 sysctl -w net.ipv4.tcp_synack_retries=1 So the question would be if it is safe to run the cluster with discovery.zen.publish_timeout set to 0 and if the behavior is to be expected that zen discovery does not perform well for a larger number of nodes? Or if there might still be something wrong with the setup? Thanks in Advance, Michel -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAH0sEYisBxUpRksMbYox06sbONjtOPtRbc%2Btqzzg%2Bu5%3DVrbrcw%40mail.gmail.com. For more options, visit https://groups.google.com/groups/opt_out.
Temporary disable river allocation (dynamic cluster setting?)
Is it possible to temporary disable river allocation? We want to separate the nodes running the rivers from the nodes holding the data, so that the nodes running the rivers can be temporary shut down or restarted without having an impact on the data nodes. After restarting the river nodes, all our rivers get allocated to the first node that joins the cluster (as it is the only node with river set to true). I suppose that a ClusterChangedEvent gets fired for one node after the other, and that the RiversRouter assigns all the rivers after the first node joining. I could not find an option to temporary disable rivers allocation, because then I could wait till all the nodes have been restarted and only then reenable river allocation. Or maybe add an option to delay river allocation. A working (but not very elegant solution) is to restart the river nodes, then restart the single node holding all the rivers, which causes the RiversRouter to redistribute the rivers uniformly among the other nodes. (Seems somewhat hacky to me ;-) and has the disadvantage that the double-restarted node has no river running) Thanks in advance for any advice or comments, Mechel -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAH0sEYifq8B3YOSmp7pk2yF%2B-XX-2yJwPu_MgXYNJNUvAUXZqQ%40mail.gmail.com. For more options, visit https://groups.google.com/groups/opt_out.