[ 
https://issues.apache.org/jira/browse/OAK-4114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15614020#comment-15614020
 ] 

Abhishek Sharma edited comment on OAK-4114 at 10/28/16 2:14 AM:
----------------------------------------------------------------

Is there any update on this issue.
The lucene index for AEM 6.1 Oak repository is corrupted on restart at times 
which takes significant time to build.


was (Author: aurorabhi):
Is there any update on this issue.
The lucene index for AEM 6.1 Oak repository is corrupted on restart at times 
which takes gignificant time to build.

> Cached lucene index gets corrupted in case of unclean shutdown and journal 
> rollback in SegmentNodeStore
> -------------------------------------------------------------------------------------------------------
>
>                 Key: OAK-4114
>                 URL: https://issues.apache.org/jira/browse/OAK-4114
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: lucene
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>            Priority: Minor
>              Labels: resilience
>             Fix For: 1.6
>
>
> Currently Oak Lucene support would copy index files to local file system as 
> part of CopyOnRead feature. In one of the setup it has been observed that 
> index logic was failing with following error
> {noformat}
> 04.02.2016 17:47:52.391 *WARN* [oak-lucene-3] 
> org.apache.jackrabbit.oak.plugins.index.lucene.IndexCopier 
> [/oak:index/lucene] Found local copy for _2ala.cfs in 
> MMapDirectory@/mnt/crx/author/crx-quickstart/repository/index/e5a943cdec3000bd8ce54924fd2070ab5d1d35b9ecf530963a3583d43bf28293/1
>  
> lockFactory=NativeFSLockFactory@/mnt/crx/author/crx-quickstart/repository/index/e5a943cdec3000bd8ce54924fd2070ab5d1d35b9ecf530963a3583d43bf28293/1
>  but size of local 9320 differs from remote 3714150. Content would be read 
> from remote file only
> 04.02.2016 17:47:52.399 *WARN* [oak-lucene-3] 
> org.apache.jackrabbit.oak.plugins.index.lucene.IndexCopier 
> [/oak:index/lucene] Found local copy for segments_28je in 
> MMapDirectory@/mnt/crx/author/crx-quickstart/repository/index/e5a943cdec3000bd8ce54924fd2070ab5d1d35b9ecf530963a3583d43bf28293/1
>  
> lockFactory=NativeFSLockFactory@/mnt/crx/author/crx-quickstart/repository/index/e5a943cdec3000bd8ce54924fd2070ab5d1d35b9ecf530963a3583d43bf28293/1
>  but size of local 1214 differs from remote 1175. Content would be read from 
> remote file only
> 04.02.2016 17:47:52.491 *ERROR* [oak-lucene-3] 
> org.apache.jackrabbit.oak.plugins.index.lucene.IndexTracker Failed to open 
> Lucene index at /oak:index/lucene
> org.apache.lucene.index.CorruptIndexException: codec header mismatch: actual 
> header=1953790076 vs expected header=1071082519 (resource: 
> SlicedIndexInput(SlicedIndexInput(_2ala.fnm in _2ala.cfs) in _2ala.cfs 
> slice=8810:9320))
>       at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:128)
>       at 
> org.apache.lucene.codecs.lucene46.Lucene46FieldInfosReader.read(Lucene46FieldInfosReader.java:56)
>       at 
> org.apache.lucene.index.SegmentReader.readFieldInfos(SegmentReader.java:215)
> {noformat}
> Here size of __2ala.cfs_ differed from remote copy and possible other index 
> file may have same size but different content. Comparing the modified time of 
> the files with those in Oak it can be seen that one of file system was older 
> than one in Oak
> {noformat}
> _2alr.cfs={name=_2alr.cfs, size=1152402, sizeStr=1.2 MB, modified=Thu Feb 04 
> 17:52:31 GMT 2016, osModified=Feb 4 17:52, osSize=1152402, mismatch=false}
> _2ala.cfe={name=_2ala.cfe, size=224, sizeStr=224 B, modified=Thu Feb 04 
> 17:47:28 GMT 2016, osModified=Feb 4 17:17, osSize=224, mismatch=false}
> _2ala.si={name=_2ala.si, size=252, sizeStr=252 B, modified=Thu Feb 04 
> 17:47:28 GMT 2016, osModified=Feb 4 17:17, osSize=252, mismatch=false}
> _2ala.cfs={name=_2ala.cfs, size=3714150, sizeStr=3.7 MB, modified=Thu Feb 04 
> 17:47:28 GMT 2016, osModified=Feb 4 17:17, osSize=9320, mismatch=true}
> _14u3_29.del={name=_14u3_29.del, size=1244036, sizeStr=1.2 MB, modified=Thu 
> Feb 04 16:37:35 GMT 2016, osModified=Feb 4 16:37, osSize=1244036, 
> mismatch=false}
> _2akw.si={name=_2akw.si, size=252, sizeStr=252 B, modified=Thu Feb 04 
> 16:37:07 GMT 2016, osModified=Feb 4 16:37, osSize=252, mismatch=false}
> _2akw.cfe={name=_2akw.cfe, size=224, sizeStr=224 B, modified=Thu Feb 04 
> 16:37:07 GMT 2016, osModified=Feb 4 16:37, osSize=224, mismatch=false}
> _2akw.cfs={name=_2akw.cfs, size=4952761, sizeStr=5.0 MB, modified=Thu Feb 04 
> 16:37:07 GMT 2016, osModified=Feb 4 16:37, osSize=4952761, mismatch=false}
> {noformat}
> And on same setup the system did saw a rollback in segment node store 
> {noformat}
> -rw-rw-r--. 1 crx crx  25961984 Feb  4 16:47 data01357a.tar
> -rw-rw-r--. 1 crx crx  24385536 Feb  4 16:41 data01357a.tar.bak
> -rw-rw-r--. 1 crx crx    359936 Feb  4 17:18 data01358a.tar
> -rw-rw-r--. 1 crx crx    345088 Feb  4 17:17 data01358a.tar.bak
> -rw-rw-r--. 1 crx crx  70582272 Feb  4 18:35 data01359a.tar
> -rw-rw-r--. 1 crx crx  66359296 Feb  4 18:33 data01359a.tar.bak
> -rw-rw-r--. 1 crx crx    282112 Feb  4 18:46 data01360a.tar
> -rw-rw-r--. 1 crx crx    236544 Feb  4 18:45 data01360a.tar.bak
> -rw-rw-r--. 1 crx crx    138240 Feb  4 18:56 data01361a.tar
> {noformat}
> So one possible cause is that 
> # At some time earlier to 17:17 lucene index got updated and __2ala.cfs_ got 
> created. 
> # Post update the head revision in Segment store was updated but the revision 
> yet to made it to journal log
> # Lucene CopyOnRead logic got event for the change and copied the file
> # System crashed and hence journal did not got updated
> # System restarted and per last entry in journal system suffered with some 
> "data loss" and hence index checkpoint also moved back
> # As checkpoint got reverted index started at earlier state and hence created 
> a file with same name __2ala.cfs_ 
> # CopyOnRead detected file length change and logged a warning routing call to 
> remote
> # However other files like _2ala.si, _2ala.cfe which were created in same 
> commit had same size but likely different content which later cause lucene 
> query to start failing
> In such a case a restart after cleaning the existing index content would have 
> brought back the system to normal state.
> So as a fix we would need to come up with some sanity check at time of system 
> startup



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to