damaged ES cluster after upgrade - serious problem - please help

Grzegorz K Wed, 17 Dec 2014 06:24:08 -0800

Hello,

I have updated ElasticSearch from ver 0.90.3 to ver 1.3.4 ( OS - Debian 
Wheezy, deb package version ).
This is a cluster configuration, with 3 nodes connected to unicast.
Update was done with ElasticSearch switched off.
Afters start new verion ElasticSearch cluster health is in 'yellow' state 
(showed by head plugin)
( and red state - showed by curl / _cluster / health ).


3 indexes in cluster has 3 unnassigned shards.

Logs from all nodes are lot of informations of "corrupted indexes" or 
"sending failed shard for"

Does update to ver 1.4.2 should fix the problem? (Due to lucene libraries 
LUCENE-5975 )
Removing index and rereading it is a last thing to do.

ES state from first node:

curl -XGET 'http://127.0.0.1:9200/_cluster/health?pretty=true'
{
  "cluster_name" : "searchcass",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 283,
  "active_shards" : 576,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 3
}

How can I fix it? Please reply. 

Regards

Grzesiek

ES log from node 1 (search01):
...
[2014-12-17 11:04:20,176][WARN ][cluster.action.shard     ] [search01] 
[201205][0] received shard failed for [201205][0], 
node[OWUJ3lZbT5i00JKgrDFUcw], [P], s[INITIALIZING], indexUUID [_na_], 
reason [master 
[search01][HYtX23nPS7uU-DeY-zF6AA][search01][inet[/192.168.199.211:9300]] 
marked shard as initializing, but shard is marked as failed, resend shard 
failure]
[2014-12-17 11:04:20,253][WARN ][indices.cluster          ] [search01] 
[201301][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: 
[201301][0] failed to fetch index version after copying it over
    at 
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:152)
    at 
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.lucene.index.CorruptIndexException: [201301][0] 
Corrupted index [corrupted_cFQBoZ-WTK2sW8mgUUv1vw] caused by: 
CorruptIndexException[did not read all bytes from file: read 9650 vs size 
9651 (resource: 
BufferedChecksumIndexInput(NIOFSIndexInput(path="/var/lib/elasticsearch/searchcass/nodes/0/indices/201301/0/index/_5f9v_k.del")))]
    at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:353)
    at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:338)
    at 
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:119)
    ... 4 more
[2014-12-17 11:04:20,279][WARN ][cluster.action.shard     ] [search01] 
[201304][4] received shard failed for [201304][4], 
node[zygoKW7SR6CwvanVoNrPcw], [P], s[INITIALIZING], indexUUID [_na_], 
reason [Failed to start shard, message 
[IndexShardGatewayRecoveryException[[201304][4] failed to fetch index 
version after copying it over]; nested: CorruptIndexException[[201304][4] 
Corrupted index [corrupted_7hrGiX_jTx2KLbQUIAiLpg] caused by: 
CorruptIndexException[did not read all bytes from file: read 295641 vs size 
295642 (resource: 
BufferedChecksumIndexInput(NIOFSIndexInput(path="/var/lib/elasticsearch/searchcass/nodes/0/indices/201304/4/index/_294h_17.del")))]];
 
]]
[2014-12-17 11:04:20,305][WARN ][cluster.action.shard     ] [search01] 
[201304][4] received shard failed for [201304][4], 
node[zygoKW7SR6CwvanVoNrPcw], [P], s[INITIALIZING], indexUUID [_na_], 
reason [master 
[search01][HYtX23nPS7uU-DeY-zF6AA][search01][inet[/192.168.199.211:9300]] 
marked shard as initializing, but shard is marked as failed, resend shard 
failure]
[2014-12-17 11:04:20,329][WARN ][cluster.action.shard     ] [search01] 
[201301][0] sending failed shard for [201301][0], 
node[HYtX23nPS7uU-DeY-zF6AA], [P], s[INITIALIZING], indexUUID [_na_], 
reason [Failed to start shard, message 
[IndexShardGatewayRecoveryException[[201301][0] failed to fetch index 
version after copying it over]; nested: CorruptIndexException[[201301][0] 
Corrupted index [corrupted_cFQBoZ-WTK2sW8mgUUv1vw] caused by: 
CorruptIndexException[did not read all bytes from file: read 9650 vs size 
9651 (resource: 
BufferedChecksumIndexInput(NIOFSIndexInput(path="/var/lib/elasticsearch/searchcassandra/nodes/0/indices/201301/0/index/_5f9v_k.del")))]];
 
]]
[2014-12-17 11:04:20,329][WARN ][cluster.action.shard     ] [search01] 
[201301][0] received shard failed for [201301][0], 
node[HYtX23nPS7uU-DeY-zF6AA], [P], s[INITIALIZING], indexUUID [_na_], 
reason [Failed to start shard, message 
[IndexShardGatewayRecoveryException[[201301][0] failed to fetch index 
version after copying it over]; nested: CorruptIndexException[[201301][0] 
Corrupted index [corrupted_cFQBoZ-WTK2sW8mgUUv1vw] caused by: 
CorruptIndexException[did not read all bytes from file: read 9650 vs size 
9651 (resource: 
BufferedChecksumIndexInput(NIOFSIndexInput(path="/var/lib/elasticsearch/searchcass/nodes/0/indices/201301/0/index/_5f9v_k.del")))]];
 
]]
[2014-12-17 11:04:20,331][WARN ][cluster.action.shard     ] [search01] 
[201301][0] received shard failed for [201301][0], 
node[HYtX23nPS7uU-DeY-zF6AA], [P], s[INITIALIZING], indexUUID [_na_], 
reason [master 
[search01][HYtX23nPS7uU-DeY-zF6AA][search01][inet[/192.168.199.211:9300]] 
marked shard as initializing, but shard is marked as failed, resend shard 
failure]
...

ES log from node 2 (search02):

[2014-12-17 11:10:11,971][WARN ][cluster.action.shard     ] [search02] 
[201301][0] sending failed shard for [201301][0], 
node[OWUJ3lZbT5i00JKgrDFUcw], [P], s[INITIALIZING], indexUUID [_na_], 
reason [Failed to start shard, message 
[IndexShardGatewayRecoveryException[[201301][0] failed to fetch index 
version after copying it over]; nested: CorruptIndexException[[201301][0] 
Corrupted index [corrupted_U1eBtw3YRYKcfuV9ZHPadw] caused by: 
CorruptIndexException[did not read all bytes from file: read 9650 vs size 
9651 (resource: 
BufferedChecksumIndexInput(NIOFSIndexInput(path="/var/lib/elasticsearch/searchcass/nodes/0/indices/201301/0/index/_5f9v_k.del")))]];
 
]]
[2014-12-17 11:10:12,258][WARN ][indices.cluster          ] [search02] 
[201205][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: 
[201205][0] failed to fetch index version after copying it over
    at 
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:152)
    at 
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.lucene.index.CorruptIndexException: [201205][0] 
Corrupted index [corrupted_xCs6wOMpR-G3pbQfUpn-Ww] caused by: 
CorruptIndexException[did not read all bytes from file: read 205 vs size 
206 (resource: 
BufferedChecksumIndexInput(NIOFSIndexInput(path="/var/lib/elasticsearch/searchcass/nodes/0/indices/201205/0/index/_1ys_3.del")))]
    at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:353)
    at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:338)
    at 
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:119)
    ... 4 more
[2014-12-17 11:10:12,278][WARN ][indices.cluster          ] [search02] 
[201304][4] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: 
[201304][4] failed to fetch index version after copying it over
    at 
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:152)
    at 
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.lucene.index.CorruptIndexException: [201304][4] 
Corrupted index [corrupted_mfMa6wjdT1m6QZ6WUBHKrA] caused by: 
CorruptIndexException[did not read all bytes from file: read 295641 vs size 
295642 (resource: 
BufferedChecksumIndexInput(NIOFSIndexInput(path="/var/lib/elasticsearch/searchcass/nodes/0/indices/201304/4/index/_294h_17.del")))]
    at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:353)
    at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:338)
    at 
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:119)
    ... 4 more
[2014-12-17 11:10:12,282][WARN ][cluster.action.shard     ] [search02] 
[201205][0] sending failed shard for [201205][0], 
node[OWUJ3lZbT5i00JKgrDFUcw], [P], s[INITIALIZING], indexUUID [_na_], 
reason [Failed to start shard, message 
[IndexShardGatewayRecoveryException[[201205][0] failed to fetch index 
version after copying it over]; nested: CorruptIndexException[[201205][0] 
Corrupted index [corrupted_xCs6wOMpR-G3pbQfUpn-Ww] caused by: 
CorruptIndexException[did not read all bytes from file: read 205 vs size 
206 (resource: 
BufferedChecksumIndexInput(NIOFSIndexInput(path="/var/lib/elasticsearch/searchcass/nodes/0/indices/201205/0/index/_1ys_3.del")))]];
 
]]
[2014-12-17 11:10:12,297][WARN ][cluster.action.shard     ] [search02] 
[201304][4] sending failed shard for [201304][4], 
node[OWUJ3lZbT5i00JKgrDFUcw], [P], s[INITIALIZING], indexUUID [_na_], 
reason [Failed to start shard, message 
[IndexShardGatewayRecoveryException[[201304][4] failed to fetch index 
version after copying it over]; nested: CorruptIndexException[[201304][4] 
Corrupted index [corrupted_mfMa6wjdT1m6QZ6WUBHKrA] caused by: 
CorruptIndexException[did not read all bytes from file: read 295641 vs size 
295642 (resource: 
BufferedChecksumIndexInput(NIOFSIndexInput(path="/var/lib/elasticsearch/searchcass/nodes/0/indices/201304/4/index/_294h_17.del")))]];
 
]]

ES log from node 3 (search03):

2014-12-17 11:13:49,541][WARN ][cluster.action.shard     ] [search03] 
[201205][0] sending failed shard for [201205][0], 
node[zygoKW7SR6CwvanVoNrPcw], [P], s[INITIALIZING], indexUUID [_na_], 
reason [Failed to start shard, message 
[IndexShardGatewayRecoveryException[[201205][0] failed to fetch index 
version after copying it over]; nested: CorruptIndexException[[201205][0] 
Corrupted index [corrupted_weSqXhW_T9Wle8wEHhEnXw] caused by: 
CorruptIndexException[did not read all bytes from file: read 205 vs size 
206 (resource: 
BufferedChecksumIndexInput(NIOFSIndexInput(path="/var/lib/elasticsearch/searchcass/nodes/0/indices/201205/0/index/_1ys_3.del")))]];
 
]]
[2014-12-17 11:13:49,581][WARN ][indices.cluster          ] [search03] 
[201304][4] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: 
[201304][4] failed to fetch index version after copying it over
    at 
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:152)
    at 
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.lucene.index.CorruptIndexException: [201304][4] 
Corrupted index [corrupted_7hrGiX_jTx2KLbQUIAiLpg] caused by: 
CorruptIndexException[did not read all bytes from file: read 295641 vs size 
295642 (resource: 
BufferedChecksumIndexInput(NIOFSIndexInput(path="/var/lib/elasticsearch/searchcass/nodes/0/indices/201304/4/index/_294h_17.del")))]
    at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:353)
    at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:338)
    at 
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:119)
    ... 4 more
[2014-12-17 11:13:49,651][WARN ][cluster.action.shard     ] [search03] 
[201304][4] sending failed shard for [201304][4], 
node[zygoKW7SR6CwvanVoNrPcw], [P], s[INITIALIZING], indexUUID [_na_], 
reason [Failed to start shard, message 
[IndexShardGatewayRecoveryException[[201304][4] failed to fetch index 
version after copying it over]; nested: CorruptIndexException[[201304][4] 
Corrupted index [corrupted_7hrGiX_jTx2KLbQUIAiLpg] caused by: 
CorruptIndexException[did not read all bytes from file: read 295641 vs size 
295642 (resource: 
BufferedChecksumIndexInput(NIOFSIndexInput(path="/var/lib/elasticsearch/searchcass/nodes/0/indices/201304/4/index/_294h_17.del")))]];
 
]]
[2014-12-17 11:13:49,747][WARN ][indices.cluster          ] [search03] 
[201205][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: 
[201205][0] failed to fetch index version after copying it over
    at 
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:152)
    at 
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.lucene.index.CorruptIndexException: [201205][0] 
Corrupted index [corrupted_weSqXhW_T9Wle8wEHhEnXw] caused by: 
CorruptIndexException[did not read all bytes from file: read 205 vs size 
206 (resource: 
BufferedChecksumIndexInput(NIOFSIndexInput(path="/var/lib/elasticsearch/searchcass/nodes/0/indices/201205/0/index/_1ys_3.del")))]
    at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:353)
    at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:338)
    at 
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:119)
    ... 4 more
[2014-12-17 11:13:49,823][WARN ][cluster.action.shard     ] [search03] 
[201205][0] sending failed shard for [201205][0], 
node[zygoKW7SR6CwvanVoNrPcw], [P], s[INITIALIZING], indexUUID [_na_], 
reason [Failed to start shard, message 
[IndexShardGatewayRecoveryException[[201205][0] failed to fetch index 
version after copying it over]; nested: CorruptIndexException[[201205][0] 
Corrupted index [corrupted_weSqXhW_T9Wle8wEHhEnXw] caused by: 
CorruptIndexException[did not read all bytes from file: read 205 vs size 
206 (resource: 
BufferedChecksumIndexInput(NIOFSIndexInput(path="/var/lib/elasticsearch/searchcass/nodes/0/indices/201205/0/index/_1ys_3.del")))]];
 
]]

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/746145b6-dd27-468c-af1e-50b4685b1a38%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

damaged ES cluster after upgrade - serious problem - please help

Reply via email to