We just encountered some mysterious problems when upgrading from 1.1.1 to 1.5.0.
The cluster consists of three machines, two data nodes and one master-only node. It hosts 86 indices which each has one replica. I stopped writes, did a snapshot and stopped the entire cluster before I upgraded the nodes and restarted them. The system came up and quickly turned yellow, but it refused to become green. it failed to recover a number of shards. The errors I got in the logs looked like this (there were a lot): [2015-03-31 07:33:39,704][WARN ][indices.cluster ] [NODE1] [signal_bin][0] sending failed shard after recovery failure org.elasticsearch.indices.recovery.RecoveryFailedException: [signal_bin][0]: Recovery failed from [NODE2][rpXLVgS8Qw2jgimXNYKn_A][NODE2][inet[/IP2:9300]]{aws_availability_zone=us-east-1d, max_local_storage_nodes=1} into [NODE1][tdXdf0MeS62DIO0KFZX-Rg][NODE1][inet[/IP1:9300]]{aws_availability_zone=us-east-1b, max_local_storage_nodes=1} at org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:274) at org.elasticsearch.indices.recovery.RecoveryTarget.access$700(RecoveryTarget.java:69) at org.elasticsearch.indices.recovery.RecoveryTarget$RecoveryRunner.doRun(RecoveryTarget.java:550) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: org.elasticsearch.transport.RemoteTransportException: [NODE2][inet[/IP2:9300]][internal:index/shard/recovery/start_recovery] Caused by: org.elasticsearch.index.engine.RecoveryEngineException: [signal_bin][0] Phase[1] Execution failed at org.elasticsearch.index.engine.InternalEngine.recover(InternalEngine.java:839) at org.elasticsearch.index.shard.IndexShard.recover(IndexShard.java:684) at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:125) at org.elasticsearch.indices.recovery.RecoverySource.access$200(RecoverySource.java:49) at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:146) at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:132) at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: [signal_bin][0] Failed to transfer [11] files with total size of [1.4mb] at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:413) at org.elasticsearch.index.engine.InternalEngine.recover(InternalEngine.java:834) ... 10 more Caused by: org.elasticsearch.transport.RemoteTransportException: [NODE1][inet[/IP1:9300]][internal:index/shard/recovery/clean_files] Caused by: org.elasticsearch.indices.recovery.RecoveryFailedException: [signal_bin][0]: Recovery failed from [NODE2][rpXLVgS8Qw2jgimXNYKn_A][NODE2][inet[/IP2:9300]]{aws_availability_zone=us-east-1d, max_local_storage_nodes=1} into [NODE1][tdXdf0MeS62DIO0KFZX-Rg][NODE1][inet[/IP1:9300]]{aws_availability_zone=us-east-1b, max_local_storage_nodes=1} (failed to clean after recovery) at org.elasticsearch.indices.recovery.RecoveryTarget$CleanFilesRequestHandler.messageReceived(RecoveryTarget.java:443) at org.elasticsearch.indices.recovery.RecoveryTarget$CleanFilesRequestHandler.messageReceived(RecoveryTarget.java:389) at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: org.elasticsearch.ElasticsearchIllegalStateException: local version: name [_yor.si], length [363], checksum [1jnqbzx], writtenBy [null] is different from remote version after recovery: name [_yor.si], length [363], checksum [null], writtenBy [null] at org.elasticsearch.index.store.Store.verifyAfterCleanup(Store.java:645) at org.elasticsearch.index.store.Store.cleanupAndVerify(Store.java:613) at org.elasticsearch.indices.recovery.RecoveryTarget$CleanFilesRequestHandler.messageReceived(RecoveryTarget.java:428) ... 6 more The index/shard mentioned varied. We finally got past his by configuring the troubling indices to have 0 replicas and then back to 1. Has anybody seen something similar? Did we hit a bug or did we do something wrong? /MaF -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f0fbd06f-0b08-49aa-a387-b78a081be59f%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.