Re: Indices not recovering after elasticsearch upgrade (1.0.2 - 1.4.1)

2014-12-03 Thread Michel Conrad
Anyone?

On Mon, Dec 1, 2014 at 11:22 AM, Michel Conrad
michel.con...@trendiction.com wrote:
 Hi,

 I just updated our test environment from 1.0.2 to 1.4.1 and some
 indices failed to recover, which seems to be related to the checksum
 verfication introduces in 1.3.

  [2014-11-28 09:40:48,019][WARN ][cluster.action.shard ] [NODE1]
 [index][0] received shard failed for [index][0],
 node[CWq_uCPhRKqGEAvtS1jkug], [P], s[INITIALIZING], indexUUID
 [yJBShgqGQgi0q5NbMms0Sg], reason [Failed to start shard, message
 [IndexShardGatewayRecoveryException[[index][0] failed to fetch index
 version after copying it over]; nested:
 CorruptIndexException[[index][0] Preexisting corrupted index
 [corrupted_JysmZSaLRXWN_BgqpRSo6Q] caused by:
 CorruptIndexException[checksum failed (hardware problem?) :
 expected=16ncx91 actual=1xc6e7g
 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@1afc89e8)]
 org.apache.lucene.index.CorruptIndexException: checksum failed
 (hardware problem?) : expected=16ncx91 actual=1xc6e7g
 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@1afc89e8)
 at 
 org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
 at org.elasticsearch.index.store.Store.verify(Store.java:365)
 at 
 org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
 at 
 org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
 at 
 org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)

 In order to get the indices to recover I check them using
 org.apache.lucene.index.CheckIndex, the indices seemed ok, as no error
 was reported. Reopening the indices did not solve the issue.

 After deleting the checksums file as well as the corrupted_XXX marker
 file, the indices finally recovered correctly. I suppose that the
 verfication step here is simply skipped as there are no checksums to
 compare against.

 I am currently trying to understand the issue. Might it be that the
 checksums file itself might have been corrupted. Also, while I did not
 see any direct consequences of deleting the checksums file, I just
 want to be sure that deleting them does not cause any issues.

 Any thoughts or help is greatly appreciated,
 Michel

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAH0sEYiv5UuLBaxJXkzJsoFaro93Kf4%2B2_WmjjuboFKvfQcUHA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Indices not recovering after elasticsearch upgrade (1.0.2 - 1.4.1)

2014-12-01 Thread Michel Conrad
Hi,

I just updated our test environment from 1.0.2 to 1.4.1 and some
indices failed to recover, which seems to be related to the checksum
verfication introduces in 1.3.

 [2014-11-28 09:40:48,019][WARN ][cluster.action.shard ] [NODE1]
[index][0] received shard failed for [index][0],
node[CWq_uCPhRKqGEAvtS1jkug], [P], s[INITIALIZING], indexUUID
[yJBShgqGQgi0q5NbMms0Sg], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[index][0] failed to fetch index
version after copying it over]; nested:
CorruptIndexException[[index][0] Preexisting corrupted index
[corrupted_JysmZSaLRXWN_BgqpRSo6Q] caused by:
CorruptIndexException[checksum failed (hardware problem?) :
expected=16ncx91 actual=1xc6e7g
resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@1afc89e8)]
org.apache.lucene.index.CorruptIndexException: checksum failed
(hardware problem?) : expected=16ncx91 actual=1xc6e7g
resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@1afc89e8)
at 
org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
at org.elasticsearch.index.store.Store.verify(Store.java:365)
at 
org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
at 
org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
at 
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

In order to get the indices to recover I check them using
org.apache.lucene.index.CheckIndex, the indices seemed ok, as no error
was reported. Reopening the indices did not solve the issue.

After deleting the checksums file as well as the corrupted_XXX marker
file, the indices finally recovered correctly. I suppose that the
verfication step here is simply skipped as there are no checksums to
compare against.

I am currently trying to understand the issue. Might it be that the
checksums file itself might have been corrupted. Also, while I did not
see any direct consequences of deleting the checksums file, I just
want to be sure that deleting them does not cause any issues.

Any thoughts or help is greatly appreciated,
Michel

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAH0sEYgS7ts_t%3DFHkBvHk0vyt_NXDsE_v4iLergwP0g0sy6kGw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Disk-based shard allocation per node settings

2014-05-16 Thread Michel Conrad
What if I set the option in the configuration file, will it then only
be applied to the node reading the configuration file?

On Fri, May 16, 2014 at 10:36 AM, Mark Walkom ma...@campaignmonitor.com wrote:
 It's a cluster level setting, so all nodes use the same one.
 As far as I know, there is no node level setting.

 Regards,
 Mark Walkom

 Infrastructure Engineer
 Campaign Monitor
 email: ma...@campaignmonitor.com
 web: www.campaignmonitor.com


 On 16 May 2014 17:43, Michel Conrad michel.con...@trendiction.com wrote:

 Is it possible to specify different thresholds for different nodes in
 the cluster? I could not find it in the documentation, when updating
 the setting through the cluster settings as described on

 http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-allocation.html#disk
 the threshold is applied to every node.

 When specifing the thresholds through the configuration file, will it
 then be only applied to the node reading the configuration? In this
 case how to I prevent the cluster settings being applied and
 overwriting the configuration file (remove a setting).

 Thanks for any thoughts,
 Michel

 --
 You received this message because you are subscribed to the Google Groups
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to elasticsearch+unsubscr...@googlegroups.com.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/elasticsearch/CAH0sEYgM4%3DmH2cA3wJoBQAmx82ddrz6N%3D8u3XE0EFoKoOwTUUQ%40mail.gmail.com.
 For more options, visit https://groups.google.com/d/optout.


 --
 You received this message because you are subscribed to the Google Groups
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to elasticsearch+unsubscr...@googlegroups.com.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/elasticsearch/CAEM624a6tk9jqGkh_ED%3Dx_bpJPxpiydP9_KxTugrBVZuX5yaCw%40mail.gmail.com.
 For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAH0sEYjc5DAO4mW%2B%3DnrgDfxbEd%2BYxwcn1vZ9jXxBNm1N%2BH9%3D1Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Slow cluster startup with zen discovery and large number of nodes.

2014-02-24 Thread Michel Conrad
Also the kernel complains about too many connections being made at
once on the joining node.
(Seems to occur after 30 nodes joined the cluster)

TCP: TCP: Possible SYN flooding on port 9300. Sending cookies.  Check
SNMP counters.

On Fri, Feb 21, 2014 at 6:44 PM, Thibaut Britz t.br...@trendiction.com wrote:
 Hi,

 I'm working with Michel on that issue:

 The cluster is completely empty and has no indexes at all. So it certainly
 is not related to revocery.
 The old elasticsearch version doesn't have code to wait for replies which
 causes the very slow startup.

 Thanks,
 Thibaut





 On Fri, Feb 21, 2014 at 6:11 PM, Binh Ly b...@hibalo.com wrote:

 You may be interested in some settings that help a full cluster restart:


 http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-gateway.html#recover-after

 There is also a webinar that talks about some of the above:

 http://www.elasticsearch.org/webinars/elasticsearch-pre-flight-checklist/

 --
 You received this message because you are subscribed to the Google Groups
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to elasticsearch+unsubscr...@googlegroups.com.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/elasticsearch/a405d32a-50a5-4d83-a3c7-8a0ea3449d28%40googlegroups.com.

 For more options, visit https://groups.google.com/groups/opt_out.



-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAH0sEYjyTwFZFP7ZPCsOquSJDrYmdTrrhkmpD5XxbKpv-YO85A%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


Slow cluster startup with zen discovery and large number of nodes.

2014-02-21 Thread Michel Conrad
Hi,

Starting a cluster with 100 nodes takes half an hour just for the
nodes to join in elasticsearch version 1.0. In version 0.19.8 nodes
were very quick to join the cluster. The issue seems to come from the
master node sending the updated state to all the nodes in the cluster
after every single addition of a node and then waiting for the nodes
to acknowledge the cluster update before adding the next node
(zen-disco-receive).

Setting discovery.zen.publish_timeout:0 seems to resolve the issue
during startup, because the master node does not block anymore, but I
am not sure if something can go wrong afterwards while running the
cluster with the timeout set to 0.

I also tried setting increasing the kernel connections, but it did not
make a difference:
sysctl -w net.ipv4.tcp_max_syn_backlog=20480
sysctl -w net.core.somaxconn=8192
sysctl -w net.ipv4.tcp_syncookies=1
sysctl -w net.ipv4.tcp_synack_retries=1

So the question would be if it is safe to run the cluster with
discovery.zen.publish_timeout set to 0 and if the behavior is to be
expected that zen discovery does not perform well for a larger number
of nodes? Or if there might still be something wrong with the setup?

Thanks in Advance,
Michel

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAH0sEYisBxUpRksMbYox06sbONjtOPtRbc%2Btqzzg%2Bu5%3DVrbrcw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


Temporary disable river allocation (dynamic cluster setting?)

2014-01-14 Thread Michel Conrad
Is it possible to temporary disable river allocation?

We want to separate the nodes running the rivers from the nodes
holding the data, so that the nodes running the rivers can be
temporary shut down or restarted without having an impact on the data
nodes. After restarting the river nodes, all our rivers get allocated
to the first node that joins the cluster (as it is the only node with
river set to true).

I suppose that a ClusterChangedEvent gets fired for one node after the
other, and that the RiversRouter assigns all the rivers after the
first node joining. I could not find an option to temporary disable
rivers allocation, because then I could wait till all the nodes have
been restarted and only then reenable river allocation. Or maybe add
an option to delay river allocation.

A working (but not very elegant solution) is to restart the river
nodes, then restart the single node holding all the rivers, which
causes the RiversRouter to redistribute the rivers uniformly among the
other nodes. (Seems somewhat hacky to me ;-) and has the disadvantage
that the double-restarted node has no river running)

Thanks in advance for any advice or comments,

Mechel

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAH0sEYifq8B3YOSmp7pk2yF%2B-XX-2yJwPu_MgXYNJNUvAUXZqQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.