We just encountered some mysterious problems when upgrading from 1.1.1 to 
1.5.0.

The cluster consists of three machines, two data nodes and one master-only 
node. It hosts 86 indices which each has one replica.

I stopped writes, did a snapshot and stopped the entire cluster before I 
upgraded the nodes and restarted them. The system came up and quickly 
turned yellow, but it refused to become green. it failed to recover a 
number of shards. The errors I got in the logs looked like this (there were 
a lot):
[2015-03-31 07:33:39,704][WARN ][indices.cluster          ] [NODE1] 
[signal_bin][0] sending failed shard after recovery failure
org.elasticsearch.indices.recovery.RecoveryFailedException: 
[signal_bin][0]: Recovery failed from 
[NODE2][rpXLVgS8Qw2jgimXNYKn_A][NODE2][inet[/IP2:9300]]{aws_availability_zone=us-east-1d,
 
max_local_storage_nodes=1} into 
[NODE1][tdXdf0MeS62DIO0KFZX-Rg][NODE1][inet[/IP1:9300]]{aws_availability_zone=us-east-1b,
 
max_local_storage_nodes=1}
    at 
org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:274)
    at 
org.elasticsearch.indices.recovery.RecoveryTarget.access$700(RecoveryTarget.java:69)
    at 
org.elasticsearch.indices.recovery.RecoveryTarget$RecoveryRunner.doRun(RecoveryTarget.java:550)
    at 
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:724)
Caused by: org.elasticsearch.transport.RemoteTransportException: 
[NODE2][inet[/IP2:9300]][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: 
[signal_bin][0] Phase[1] Execution failed
    at 
org.elasticsearch.index.engine.InternalEngine.recover(InternalEngine.java:839)
    at org.elasticsearch.index.shard.IndexShard.recover(IndexShard.java:684)
    at 
org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:125)
    at 
org.elasticsearch.indices.recovery.RecoverySource.access$200(RecoverySource.java:49)
    at 
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:146)
    at 
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:132)
    at 
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279)
    at 
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:724)
Caused by: 
org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: 
[signal_bin][0] Failed to transfer [11] files with total size of [1.4mb]
    at 
org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:413)
    at 
org.elasticsearch.index.engine.InternalEngine.recover(InternalEngine.java:834)
    ... 10 more
Caused by: org.elasticsearch.transport.RemoteTransportException: 
[NODE1][inet[/IP1:9300]][internal:index/shard/recovery/clean_files]
Caused by: org.elasticsearch.indices.recovery.RecoveryFailedException: 
[signal_bin][0]: Recovery failed from 
[NODE2][rpXLVgS8Qw2jgimXNYKn_A][NODE2][inet[/IP2:9300]]{aws_availability_zone=us-east-1d,
 
max_local_storage_nodes=1} into 
[NODE1][tdXdf0MeS62DIO0KFZX-Rg][NODE1][inet[/IP1:9300]]{aws_availability_zone=us-east-1b,
 
max_local_storage_nodes=1} (failed to clean after recovery)
    at 
org.elasticsearch.indices.recovery.RecoveryTarget$CleanFilesRequestHandler.messageReceived(RecoveryTarget.java:443)
    at 
org.elasticsearch.indices.recovery.RecoveryTarget$CleanFilesRequestHandler.messageReceived(RecoveryTarget.java:389)
    at 
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279)
    at 
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:724)
Caused by: org.elasticsearch.ElasticsearchIllegalStateException: local 
version: name [_yor.si], length [363], checksum [1jnqbzx], writtenBy [null] 
is different from remote version after recovery: name [_yor.si], length 
[363], checksum [null], writtenBy [null]
    at 
org.elasticsearch.index.store.Store.verifyAfterCleanup(Store.java:645)
    at org.elasticsearch.index.store.Store.cleanupAndVerify(Store.java:613)
    at 
org.elasticsearch.indices.recovery.RecoveryTarget$CleanFilesRequestHandler.messageReceived(RecoveryTarget.java:428)
    ... 6 more

The index/shard mentioned varied. We finally got past his by configuring 
the troubling indices to have 0 replicas and then back to 1.

Has anybody seen something similar? Did we hit a bug or did we do something 
wrong?

    /MaF

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/f0fbd06f-0b08-49aa-a387-b78a081be59f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to