plus 1 for a less invasive way to recover data

I had a similar issue today on one of our test servers where I eventually 
managed to recover my index by running CheckIndex on one of my shards. In 
my case, I also had to remove the translog recovery file to actually get 
the cluster green. This is one of those steps that seems to be omitted in 
most mentions of the CheckIndex tool in combination with ElasticSearch.

Anyway, after this, I ran CheckIndex on some other shards that were 
supposedly fine and was a bit surprised when it actually reported and fixed 
some errors there too.

This makes me wonder if there should be a proper API around this tool in 
elasticsearch that allows you to run proper corruption checks on the whole 
cluster and fix problems. It would be nice if you could run some 
diagnostics to confirm your data is actually 100% OK. I know elasticsearch 
has increasingly more checks that run on startup involving checksums, etc. 
But it also seems those checks failed to detect problems that CheckIndex 
seems to think need fixing. That sounds like something most admins would 
like to know about their cluster.

On Thursday, February 12, 2015 at 10:44:26 AM UTC+1, Philipp Knobel wrote:
>
> Hi all,
>
> we recently had an issue with ES that it reported a file corruption (more 
> specifically a read past EOF error) after some imports/deletion for a 
> longer timeframe. ES reported on a few nodes a long garbage collection 
> time, but then was silent again until it started to show the EOF exception. 
> From what I could find on the internet this kind of exception can happen if 
> an OutOfMemory error is happening or no space on disk is left. Both did not 
> occur in our scenario. I don't understand how this could happen in the 
> first place. We're running ES 1.3.4 and the migrated a while ago from 0.20.
>
> *[2015-02-06 01:15:11.971 GMT] INFO |||||| 
> elasticsearch[3-6][scheduler][T#1] org.elasticsearch.monitor.jvm  [3-6] 
> [gc][young][618719][105280] duration [962ms], collections [1]/[1.6s], total 
> [962ms]/[16.8m], memory [435.2mb]->[425.9mb]/[1.9gb], all_pools {[young] 
> [28.2mb]->[5.3mb]/[546.1mb]}{[survivor] [6.3mb]->[6.3mb]/[68.2mb]}{[old] 
> [400.5mb]->[414.2mb]/[1.3gb]}*
> *[2015-02-06 07:20:44.188 GMT] WARN |||||| elasticsearch[3-6][[order][3]: 
> Lucene Merge Thread #17] org.elasticsearch.index.merge.scheduler  [3-6] 
> [order][3] failed to merge*
> *java.io.EOFException: read past EOF: 
> NIOFSIndexInput(path="/data/cluster1/nodes/0/indices/order/3/index/_dr3z.fdt")*
> *  at 
> org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:144)*
> *  at 
> org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:116)*
> *  at 
> org.apache.lucene.codecs.lucene3x.Lucene3xStoredFieldsReader.readField(Lucene3xStoredFieldsReader.java:273)*
> *  at 
> org.apache.lucene.codecs.lucene3x.Lucene3xStoredFieldsReader.visitDocument(Lucene3xStoredFieldsReader.java:240)*
> *  at 
> org.apache.lucene.index.SegmentReader.document(SegmentReader.java:341)*
> *  at 
> org.apache.lucene.index.FilterAtomicReader.document(FilterAtomicReader.java:389)*
> *  at org.apache.lucene.index.IndexReader.document(IndexReader.java:460)*
> *  at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.merge(CompressingStoredFieldsWriter.java:355)*
> *  at 
> org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:332)*
> *  at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:100)*
> *  at 
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4225)*
> *  at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3820)*
> *  at 
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)*
> *  at 
> org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:106)*
> *  at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)*
>
> We ran a checkIndex and it reported that for this .fdt file and the 
> corresponding *.tis* file a read past EOF exception was discovered.
>
> *  2 of 29: name=_dr3z docCount=575018*
> *    codec=Lucene3x*
> *    compound=false*
> *    numFiles=11*
> *    size (MB)=512.496*
> *    diagnostics = {os=Linux, os.version=3.1.6, mergeFactor=10, 
> source=merge, lucene.version=3.6.2 1423725 - rmuir - 2012-12-18 19:45:40, 
> os.arch=amd64, mergeMaxNumSegments=-1, java.version=1.7.0_51, 
> java.vendor=Oracle Corporation}*
> *    has deletions [delGen=422]*
> *    test: open reader.........OK*
> *    test: check integrity.....OK*
> *    test: check live docs.....OK [419388 deleted docs]*
> *    test: fields..............OK [132 fields]*
> *    test: field norms.........OK [48 fields]*
> *    test: terms, freq, prox...ERROR: java.io.EOFException: seek past EOF: 
> MMapIndexInput(path="/data/cluster1/nodes/0/indices/order/3/index/_dr3z.tis")*
> *java.io.EOFException: seek past EOF: 
> MMapIndexInput(path="/data/cluster1/nodes/0/indices/order/3/index/_dr3z.tis")*
> *  at 
> org.apache.lucene.store.ByteBufferIndexInput$SingleBufferImpl.seek(ByteBufferIndexInput.java:431)*
> *  at 
> org.apache.lucene.codecs.lucene3x.SegmentTermEnum.seek(SegmentTermEnum.java:127)*
> *  at 
> org.apache.lucene.codecs.lucene3x.TermInfosReaderIndex.seekEnum(TermInfosReaderIndex.java:153)*
> *  at 
> org.apache.lucene.codecs.lucene3x.TermInfosReader.seekEnum(TermInfosReader.java:287)*
> *  at 
> org.apache.lucene.codecs.lucene3x.TermInfosReader.seekEnum(TermInfosReader.java:232)*
> *  at 
> org.apache.lucene.codecs.lucene3x.Lucene3xFields$PreTermsEnum.seekCeil(Lucene3xFields.java:750)*
> *  at org.apache.lucene.index.Terms.getMax(Terms.java:182)*
> *  at org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:795)*
> *  at 
> org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:1325)*
> *  at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:631)*
> *  at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:2051)*
> *    test: stored fields.......ERROR [read past EOF: 
> MMapIndexInput(path="/data/cluster1/nodes/0/indices/order/3/index/_dr3z.fdt")]*
> *java.io.EOFException: read past EOF: 
> MMapIndexInput(path="/data/cluster1/nodes/0/indices/order/3/index/_dr3z.fdt")*
> *  at 
> org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:104)*
> *  at 
> org.apache.lucene.codecs.lucene3x.Lucene3xStoredFieldsReader.readField(Lucene3xStoredFieldsReader.java:273)*
> *  at 
> org.apache.lucene.codecs.lucene3x.Lucene3xStoredFieldsReader.visitDocument(Lucene3xStoredFieldsReader.java:240)*
> *  at 
> org.apache.lucene.index.SegmentReader.document(SegmentReader.java:341)*
> *  at org.apache.lucene.index.IndexReader.document(IndexReader.java:460)*
> *  at 
> org.apache.lucene.index.CheckIndex.testStoredFields(CheckIndex.java:1361)*
> *  at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:634)*
> *  at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:2051)*
> *    test: term vectors........OK [0 total vector count; avg 0 term/freq 
> vector fields per doc]*
> *    test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 
> NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET]*
> *FAILED*
> *    WARNING: fixIndex() would remove reference to this segment; full 
> exception:*
> *java.lang.RuntimeException: Term Index test failed*
> *  at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:646)*
> *  at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:2051)*
>
> One strange thing is that this segment is the only one still being on 
> 3.6.2, rather than 4.9.1 like the others are. The *.tis* file was only 
> reported once in our logs, being not found, but this was after some "long" 
> time the *.fdt* file was complained about.
>
> *[2015-02-06 10:31:56.060] WARN 
> elasticsearch[blade5-2\][clusterService#updateTask\][T#1\] 
> org.elasticsearch.index.store [5-2] [order][3] Can't open file to read 
> checksums java.io.FileNotFoundException: No such file [_dr3z.tis] at 
> org.elasticsearch.index.store.DistributorDirectory.getDirectory(DistributorDirectory.java:176)
>  
> at 
> org.elasticsearch.index.store.DistributorDirectory.getDirectory(DistributorDirectory.java:144)
>  
> at 
> org.elasticsearch.index.store.DistributorDirectory.fileLength(DistributorDirectory.java:113)
>  
> at 
> org.elasticsearch.index.store.Store$MetadataSnapshot.buildMetadata(Store.java:482)
>  
> at 
> org.elasticsearch.index.store.Store$MetadataSnapshot.<init>(Store.java:456) 
> at org.elasticsearch.index.store.Store.getMetadata(Store.java:154) at 
> org.elasticsearch.index.store.Store.getMetadata(Store.java:143) at 
> org.elasticsearch.indices.cluster.IndicesClusterStateService.applyInitializingShard(IndicesClusterStateService.java:728)
>  
> at 
> org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewOrUpdatedShards(IndicesClusterStateService.java:580)
>  
> at 
> org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:184)
>  
> at 
> org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:444)
>  
> at 
> org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:153)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  
> at java.lang.Thread.run(Thread.java:745)*
>
> We fixed this issue by shutting down the cluster and running checkIndex on 
> the affected nodes, but I would like to know if there's a less invasive way 
> to perform this, if this issue should happen again?
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/09ad0f24-96c5-4a06-b7b4-faf66ff439f1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to