[ https://issues.apache.org/jira/browse/OAK-4114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chetan Mehrotra updated OAK-4114: --------------------------------- Labels: resilience (was: ) > Cached lucene index gets corrupted in case of unclean shutdown and journal > rollback in SegmentNodeStore > ------------------------------------------------------------------------------------------------------- > > Key: OAK-4114 > URL: https://issues.apache.org/jira/browse/OAK-4114 > Project: Jackrabbit Oak > Issue Type: Bug > Components: lucene > Reporter: Chetan Mehrotra > Assignee: Chetan Mehrotra > Priority: Minor > Labels: resilience > Fix For: 1.6 > > > Currently Oak Lucene support would copy index files to local file system as > part of CopyOnRead feature. In one of the setup it has been observed that > index logic was failing with following error > {noformat} > 04.02.2016 17:47:52.391 *WARN* [oak-lucene-3] > org.apache.jackrabbit.oak.plugins.index.lucene.IndexCopier > [/oak:index/lucene] Found local copy for _2ala.cfs in > MMapDirectory@/mnt/crx/author/crx-quickstart/repository/index/e5a943cdec3000bd8ce54924fd2070ab5d1d35b9ecf530963a3583d43bf28293/1 > > lockFactory=NativeFSLockFactory@/mnt/crx/author/crx-quickstart/repository/index/e5a943cdec3000bd8ce54924fd2070ab5d1d35b9ecf530963a3583d43bf28293/1 > but size of local 9320 differs from remote 3714150. Content would be read > from remote file only > 04.02.2016 17:47:52.399 *WARN* [oak-lucene-3] > org.apache.jackrabbit.oak.plugins.index.lucene.IndexCopier > [/oak:index/lucene] Found local copy for segments_28je in > MMapDirectory@/mnt/crx/author/crx-quickstart/repository/index/e5a943cdec3000bd8ce54924fd2070ab5d1d35b9ecf530963a3583d43bf28293/1 > > lockFactory=NativeFSLockFactory@/mnt/crx/author/crx-quickstart/repository/index/e5a943cdec3000bd8ce54924fd2070ab5d1d35b9ecf530963a3583d43bf28293/1 > but size of local 1214 differs from remote 1175. Content would be read from > remote file only > 04.02.2016 17:47:52.491 *ERROR* [oak-lucene-3] > org.apache.jackrabbit.oak.plugins.index.lucene.IndexTracker Failed to open > Lucene index at /oak:index/lucene > org.apache.lucene.index.CorruptIndexException: codec header mismatch: actual > header=1953790076 vs expected header=1071082519 (resource: > SlicedIndexInput(SlicedIndexInput(_2ala.fnm in _2ala.cfs) in _2ala.cfs > slice=8810:9320)) > at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:128) > at > org.apache.lucene.codecs.lucene46.Lucene46FieldInfosReader.read(Lucene46FieldInfosReader.java:56) > at > org.apache.lucene.index.SegmentReader.readFieldInfos(SegmentReader.java:215) > {noformat} > Here size of __2ala.cfs_ differed from remote copy and possible other index > file may have same size but different content. Comparing the modified time of > the files with those in Oak it can be seen that one of file system was older > than one in Oak > {noformat} > _2alr.cfs={name=_2alr.cfs, size=1152402, sizeStr=1.2 MB, modified=Thu Feb 04 > 17:52:31 GMT 2016, osModified=Feb 4 17:52, osSize=1152402, mismatch=false} > _2ala.cfe={name=_2ala.cfe, size=224, sizeStr=224 B, modified=Thu Feb 04 > 17:47:28 GMT 2016, osModified=Feb 4 17:17, osSize=224, mismatch=false} > _2ala.si={name=_2ala.si, size=252, sizeStr=252 B, modified=Thu Feb 04 > 17:47:28 GMT 2016, osModified=Feb 4 17:17, osSize=252, mismatch=false} > _2ala.cfs={name=_2ala.cfs, size=3714150, sizeStr=3.7 MB, modified=Thu Feb 04 > 17:47:28 GMT 2016, osModified=Feb 4 17:17, osSize=9320, mismatch=true} > _14u3_29.del={name=_14u3_29.del, size=1244036, sizeStr=1.2 MB, modified=Thu > Feb 04 16:37:35 GMT 2016, osModified=Feb 4 16:37, osSize=1244036, > mismatch=false} > _2akw.si={name=_2akw.si, size=252, sizeStr=252 B, modified=Thu Feb 04 > 16:37:07 GMT 2016, osModified=Feb 4 16:37, osSize=252, mismatch=false} > _2akw.cfe={name=_2akw.cfe, size=224, sizeStr=224 B, modified=Thu Feb 04 > 16:37:07 GMT 2016, osModified=Feb 4 16:37, osSize=224, mismatch=false} > _2akw.cfs={name=_2akw.cfs, size=4952761, sizeStr=5.0 MB, modified=Thu Feb 04 > 16:37:07 GMT 2016, osModified=Feb 4 16:37, osSize=4952761, mismatch=false} > {noformat} > And on same setup the system did saw a rollback in segment node store > {noformat} > -rw-rw-r--. 1 crx crx 25961984 Feb 4 16:47 data01357a.tar > -rw-rw-r--. 1 crx crx 24385536 Feb 4 16:41 data01357a.tar.bak > -rw-rw-r--. 1 crx crx 359936 Feb 4 17:18 data01358a.tar > -rw-rw-r--. 1 crx crx 345088 Feb 4 17:17 data01358a.tar.bak > -rw-rw-r--. 1 crx crx 70582272 Feb 4 18:35 data01359a.tar > -rw-rw-r--. 1 crx crx 66359296 Feb 4 18:33 data01359a.tar.bak > -rw-rw-r--. 1 crx crx 282112 Feb 4 18:46 data01360a.tar > -rw-rw-r--. 1 crx crx 236544 Feb 4 18:45 data01360a.tar.bak > -rw-rw-r--. 1 crx crx 138240 Feb 4 18:56 data01361a.tar > {noformat} > So one possible cause is that > # At some time earlier to 17:17 lucene index got updated and __2ala.cfs_ got > created. > # Post update the head revision in Segment store was updated but the revision > yet to made it to journal log > # Lucene CopyOnRead logic got event for the change and copied the file > # System crashed and hence journal did not got updated > # System restarted and per last entry in journal system suffered with some > "data loss" and hence index checkpoint also moved back > # As checkpoint got reverted index started at earlier state and hence created > a file with same name __2ala.cfs_ > # CopyOnRead detected file length change and logged a warning routing call to > remote > # However other files like _2ala.si, _2ala.cfe which were created in same > commit had same size but likely different content which later cause lucene > query to start failing > In such a case a restart after cleaning the existing index content would have > brought back the system to normal state. > So as a fix we would need to come up with some sanity check at time of system > startup -- This message was sent by Atlassian JIRA (v6.3.4#6332)