[ 
https://issues.apache.org/jira/browse/OAK-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629197#comment-14629197
 ] 

Chetan Mehrotra edited comment on OAK-3110 at 7/16/15 5:05 AM:
---------------------------------------------------------------

Attaching [logs|^copier.log] from IndexCopier and AsyncIndexUpdate for a single 
run where the run failed due to exception in CopyOnWrite. The issue here has 
its root in OAK-2855

h3. Interaction between CopyOnWrite (COW) and CopyOnRead (COR)

COW logic is implemented with the requirement that it has to *share* the same 
local directory as used in COR. This ensures that it can make use of already 
copied index files while indexing. Any indexing sequence would work in 
following sequence

# COW - Index done - Index moved to R1
# COR - Open at R1
# COW - Index moves to R2
# COR - Picks up index at R2
# COW - Starts indexing for R3

Further COR takes a snapshot of current local directory content and when it 
closes it deletes those files which are present in local directory but not 
present in remote directory. This ensures that issues seen in OAK-2855 do not 
happen. This approach would have worked fine if above sequence is strictly 
sequential. However in real world thats not the case as the 2 tasks are 
performed via different threads. 

For e.g. say at #3 when COW is done with indexing the Lucene index update 
observor would detect that index has updated. This observation processing would 
happen in a different thread. Now when at #4 COR is opening the index it can 
happen that COW has also started with its index at #5 and hence started 
creating new files. These new files would be picked by COR at #4 during its 
opening phase and it would consider them part of old index. Now when COR would 
close it would delete these files and that would then result in a 
FileNotFoundException for COW in #5

h3. Mapping to Logs

In the attached logs following timeline is seen

{noformat}
14:20:01.700 *DEBUG* [pool-58-thread-1] AsyncIndexUpdate Running background 
index task async
14:20:08.601 *DEBUG* [pool-58-thread-1] IndexCopier [COW] Creating output 
_2s7.fdt
14:20:18.146 *DEBUG* [oak-lucene-0] IndexCopier [/oak:index/foxDamIndex] 
Following files have been removed from Lucene index directory [_2s7.fdt, 
_2s7.fdx]
14:20:35.752 *DEBUG* [oak-lucene-0] IndexCopier Error occurred while copying 
files. Further processing would be skipped
java.io.FileNotFoundException: _2s7.fdt
        at org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:261)
14:20:35.930 *WARN* [pool-58-thread-1] AsyncIndexUpdate The async index update 
failed
{noformat}

So as can be seen that COW starts at 20:01 and creates _2s7.fdt at 20:08. 
However that file got deleted by oak-lucene-0 (thread which opens index with 
COR) at 20:18 and then eventually FileNotFoundException is thrown when copy 
logic in COW tries to read the file

h3. Fix

As a fix COW and COR would have to share the current working set and COR should 
not touch files in that working set


was (Author: chetanm):
Attaching [logs|^copier.log] from IndexCopier and AsyncIndexUpdate for a single 
run where the run failed due to exception in CopyOnWrite. The issue here has 
its root in OAK-2855

h3. Interaction between CopyOnWrite (COW) and CopyOnRead (COR)

COW logic is implemented with the requirement that it has sure the same local 
directory as used in COR. This ensures that it can make use of alread copied 
index files while indexing. Any indexing sequence would work in following 
sequence

# COW - Index done - Index moved to R1
# COR - Open at R1
# COW - Index moves to R2
# COR - Picks up index at R2
# COW - Starts indexing for R3

Further COR takes a snapshot of current local directory content and when it 
closes it deletes those files which are present in local directory but not 
present in remote directory. This ensures that issues seen in OAK-2855 do not 
happen. This approach would have worked fine if above sequence is strictly 
sequential. However in real world thats not the case as the 2 task are 
performed via different threads. 

For e.g. say at #3 when COW is done with indexing the Lucene index update 
observor would detect that index has updated. This observation processing would 
happen in a different thread. Now when at #4 COR is opening the index it can 
happen that COW has also started with its index at #5 and hence started 
creating new files. These new files would be picked by COR at #4 during its 
opening phase and it would consider them part of old index. Now when COR would 
close it would delete these files and that would then result in a 
FileNotFoundException for COW in #5

h3. Mapping to Logs

In the attached logs following timeline is seen

{noformat}
14:20:01.700 *DEBUG* [pool-58-thread-1] AsyncIndexUpdate Running background 
index task async
14:20:08.601 *DEBUG* [pool-58-thread-1] IndexCopier [COW] Creating output 
_2s7.fdt
14:20:18.146 *DEBUG* [oak-lucene-0] IndexCopier [/oak:index/foxDamIndex] 
Following files have been removed from Lucene index directory [_2s7.fdt, 
_2s7.fdx]
14:20:35.752 *DEBUG* [oak-lucene-0] IndexCopier Error occurred while copying 
files. Further processing would be skipped
java.io.FileNotFoundException: _2s7.fdt
        at org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:261)
14:20:35.930 *WARN* [pool-58-thread-1] AsyncIndexUpdate The async index update 
failed
{noformat}

So as can be seen that COW starts at 20:01 and creates _2s7.fdt at 20:08. 
However that file got deleted by oak-lucene-0 (thread which opens index with 
COR) at 20:18 and then eventually FileNotFoundException is thrown when copy 
logic in COW tries to read the file

h3. Fix

As a fix COW and COR would have to share the current working set and COR should 
not touch files in that working set

> AsyncIndexer fails due to FileNotFoundException thrown by CopyOnWrite logic
> ---------------------------------------------------------------------------
>
>                 Key: OAK-3110
>                 URL: https://issues.apache.org/jira/browse/OAK-3110
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: lucene
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>             Fix For: 1.2.3, 1.3.3, 1.0.18
>
>         Attachments: copier.log
>
>
> At times the CopyOnWrite reports following exception
> {noformat}
> 15.07.2015 14:20:35.930 *WARN* [pool-58-thread-1] 
> org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate The async index 
> update failed
> org.apache.jackrabbit.oak.api.CommitFailedException: OakLucene0004: Failed to 
> close the Lucene index
>       at 
> org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditor.leave(LuceneIndexEditor.java:204)
>       at 
> org.apache.jackrabbit.oak.plugins.index.IndexUpdate.leave(IndexUpdate.java:219)
>       at 
> org.apache.jackrabbit.oak.spi.commit.VisibleEditor.leave(VisibleEditor.java:63)
>       at 
> org.apache.jackrabbit.oak.spi.commit.EditorDiff.process(EditorDiff.java:56)
>       at 
> org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate.updateIndex(AsyncIndexUpdate.java:366)
>       at 
> org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate.run(AsyncIndexUpdate.java:311)
>       at 
> org.apache.sling.commons.scheduler.impl.QuartzJobExecutor.execute(QuartzJobExecutor.java:105)
>       at org.quartz.core.JobRunShell.run(JobRunShell.java:207)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.FileNotFoundException: _2s7.fdt
>       at org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:261)
>       at 
> org.apache.jackrabbit.oak.plugins.index.lucene.IndexCopier$CopyOnWriteDirectory$COWLocalFileReference.fileLength(IndexCopier.java:837)
>       at 
> org.apache.jackrabbit.oak.plugins.index.lucene.IndexCopier$CopyOnWriteDirectory.fileLength(IndexCopier.java:607)
>       at 
> org.apache.lucene.index.SegmentCommitInfo.sizeInBytes(SegmentCommitInfo.java:141)
>       at 
> org.apache.lucene.index.DocumentsWriterPerThread.sealFlushedSegment(DocumentsWriterPerThread.java:529)
>       at 
> org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:502)
>       at 
> org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:508)
>       at 
> org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:618)
>       at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3147)
>       at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3123)
>       at 
> org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:988)
>       at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:932)
>       at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:894)
>       at 
> org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditorContext.closeWriter(LuceneIndexEditorContext.java:192)
>       at 
> org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditor.leave(LuceneIndexEditor.java:202)
>       ... 10 common frames omitted
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to