[ https://issues.apache.org/jira/browse/OAK-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629197#comment-14629197 ]
Chetan Mehrotra edited comment on OAK-3110 at 7/16/15 5:05 AM: --------------------------------------------------------------- Attaching [logs|^copier.log] from IndexCopier and AsyncIndexUpdate for a single run where the run failed due to exception in CopyOnWrite. The issue here has its root in OAK-2855 h3. Interaction between CopyOnWrite (COW) and CopyOnRead (COR) COW logic is implemented with the requirement that it has to *share* the same local directory as used in COR. This ensures that it can make use of already copied index files while indexing. Any indexing sequence would work in following sequence # COW - Index done - Index moved to R1 # COR - Open at R1 # COW - Index moves to R2 # COR - Picks up index at R2 # COW - Starts indexing for R3 Further COR takes a snapshot of current local directory content and when it closes it deletes those files which are present in local directory but not present in remote directory. This ensures that issues seen in OAK-2855 do not happen. This approach would have worked fine if above sequence is strictly sequential. However in real world thats not the case as the 2 tasks are performed via different threads. For e.g. say at #3 when COW is done with indexing the Lucene index update observor would detect that index has updated. This observation processing would happen in a different thread. Now when at #4 COR is opening the index it can happen that COW has also started with its index at #5 and hence started creating new files. These new files would be picked by COR at #4 during its opening phase and it would consider them part of old index. Now when COR would close it would delete these files and that would then result in a FileNotFoundException for COW in #5 h3. Mapping to Logs In the attached logs following timeline is seen {noformat} 14:20:01.700 *DEBUG* [pool-58-thread-1] AsyncIndexUpdate Running background index task async 14:20:08.601 *DEBUG* [pool-58-thread-1] IndexCopier [COW] Creating output _2s7.fdt 14:20:18.146 *DEBUG* [oak-lucene-0] IndexCopier [/oak:index/foxDamIndex] Following files have been removed from Lucene index directory [_2s7.fdt, _2s7.fdx] 14:20:35.752 *DEBUG* [oak-lucene-0] IndexCopier Error occurred while copying files. Further processing would be skipped java.io.FileNotFoundException: _2s7.fdt at org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:261) 14:20:35.930 *WARN* [pool-58-thread-1] AsyncIndexUpdate The async index update failed {noformat} So as can be seen that COW starts at 20:01 and creates _2s7.fdt at 20:08. However that file got deleted by oak-lucene-0 (thread which opens index with COR) at 20:18 and then eventually FileNotFoundException is thrown when copy logic in COW tries to read the file h3. Fix As a fix COW and COR would have to share the current working set and COR should not touch files in that working set was (Author: chetanm): Attaching [logs|^copier.log] from IndexCopier and AsyncIndexUpdate for a single run where the run failed due to exception in CopyOnWrite. The issue here has its root in OAK-2855 h3. Interaction between CopyOnWrite (COW) and CopyOnRead (COR) COW logic is implemented with the requirement that it has sure the same local directory as used in COR. This ensures that it can make use of alread copied index files while indexing. Any indexing sequence would work in following sequence # COW - Index done - Index moved to R1 # COR - Open at R1 # COW - Index moves to R2 # COR - Picks up index at R2 # COW - Starts indexing for R3 Further COR takes a snapshot of current local directory content and when it closes it deletes those files which are present in local directory but not present in remote directory. This ensures that issues seen in OAK-2855 do not happen. This approach would have worked fine if above sequence is strictly sequential. However in real world thats not the case as the 2 task are performed via different threads. For e.g. say at #3 when COW is done with indexing the Lucene index update observor would detect that index has updated. This observation processing would happen in a different thread. Now when at #4 COR is opening the index it can happen that COW has also started with its index at #5 and hence started creating new files. These new files would be picked by COR at #4 during its opening phase and it would consider them part of old index. Now when COR would close it would delete these files and that would then result in a FileNotFoundException for COW in #5 h3. Mapping to Logs In the attached logs following timeline is seen {noformat} 14:20:01.700 *DEBUG* [pool-58-thread-1] AsyncIndexUpdate Running background index task async 14:20:08.601 *DEBUG* [pool-58-thread-1] IndexCopier [COW] Creating output _2s7.fdt 14:20:18.146 *DEBUG* [oak-lucene-0] IndexCopier [/oak:index/foxDamIndex] Following files have been removed from Lucene index directory [_2s7.fdt, _2s7.fdx] 14:20:35.752 *DEBUG* [oak-lucene-0] IndexCopier Error occurred while copying files. Further processing would be skipped java.io.FileNotFoundException: _2s7.fdt at org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:261) 14:20:35.930 *WARN* [pool-58-thread-1] AsyncIndexUpdate The async index update failed {noformat} So as can be seen that COW starts at 20:01 and creates _2s7.fdt at 20:08. However that file got deleted by oak-lucene-0 (thread which opens index with COR) at 20:18 and then eventually FileNotFoundException is thrown when copy logic in COW tries to read the file h3. Fix As a fix COW and COR would have to share the current working set and COR should not touch files in that working set > AsyncIndexer fails due to FileNotFoundException thrown by CopyOnWrite logic > --------------------------------------------------------------------------- > > Key: OAK-3110 > URL: https://issues.apache.org/jira/browse/OAK-3110 > Project: Jackrabbit Oak > Issue Type: Bug > Components: lucene > Reporter: Chetan Mehrotra > Assignee: Chetan Mehrotra > Fix For: 1.2.3, 1.3.3, 1.0.18 > > Attachments: copier.log > > > At times the CopyOnWrite reports following exception > {noformat} > 15.07.2015 14:20:35.930 *WARN* [pool-58-thread-1] > org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate The async index > update failed > org.apache.jackrabbit.oak.api.CommitFailedException: OakLucene0004: Failed to > close the Lucene index > at > org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditor.leave(LuceneIndexEditor.java:204) > at > org.apache.jackrabbit.oak.plugins.index.IndexUpdate.leave(IndexUpdate.java:219) > at > org.apache.jackrabbit.oak.spi.commit.VisibleEditor.leave(VisibleEditor.java:63) > at > org.apache.jackrabbit.oak.spi.commit.EditorDiff.process(EditorDiff.java:56) > at > org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate.updateIndex(AsyncIndexUpdate.java:366) > at > org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate.run(AsyncIndexUpdate.java:311) > at > org.apache.sling.commons.scheduler.impl.QuartzJobExecutor.execute(QuartzJobExecutor.java:105) > at org.quartz.core.JobRunShell.run(JobRunShell.java:207) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.FileNotFoundException: _2s7.fdt > at org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:261) > at > org.apache.jackrabbit.oak.plugins.index.lucene.IndexCopier$CopyOnWriteDirectory$COWLocalFileReference.fileLength(IndexCopier.java:837) > at > org.apache.jackrabbit.oak.plugins.index.lucene.IndexCopier$CopyOnWriteDirectory.fileLength(IndexCopier.java:607) > at > org.apache.lucene.index.SegmentCommitInfo.sizeInBytes(SegmentCommitInfo.java:141) > at > org.apache.lucene.index.DocumentsWriterPerThread.sealFlushedSegment(DocumentsWriterPerThread.java:529) > at > org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:502) > at > org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:508) > at > org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:618) > at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3147) > at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3123) > at > org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:988) > at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:932) > at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:894) > at > org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditorContext.closeWriter(LuceneIndexEditorContext.java:192) > at > org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditor.leave(LuceneIndexEditor.java:202) > ... 10 common frames omitted > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)