[ 
https://issues.apache.org/jira/browse/OAK-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502794#comment-14502794
 ] 

Marcel Reutegger commented on OAK-2778:
---------------------------------------

The root cause is most likely a race condition in the VersionGarbageCollector. 
It first gets a list of candidate documents to remove. The condition for those 
documents is: the _deletedOnce flag is true and the _modified timestamp is 
older than one day. Then the VersionGarbageCollector goes through each of the 
documents and checks if the node based on that document exists at the head 
revision when the GC started. It the node doesn't exist, the document is 
considered garbage and removed.
The race condition occurs, when the node is re-created after the GC started and 
is going through the candidate documents. The VersionGarbageCollector will 
consider the document as garbage even though it was modified after the given 
timestamp.
The race condition is usually quite unlikely for the lucene index update. 
Almost all lucene files are write once and never modified. The exception is 
when a re-index occurs. Oak will delete all existing lucene files and start a 
new index from scratch (atomically). This means the new index files will use 
names of files that existed before. Again, in most cases even a re-index is not 
a problem, because of the way lucene assigns names to files. Almost all lucene 
files have a generation suffix, which is a hexadecimal number, which increments 
with each new generation. The suffix is monotonically increasing and starts as 
0. This means, the garbage collector will always remove the oldest generations 
and not touch the active part of the index. Even if a re-index occurs, the 
previously active lucene index will be at a rather high generation, which makes 
it unlikely to overlap with the generation of the new index.

The probability increases considerably if multiple re-indexes occur, e.g. once 
a day. This means the lucene generations eligible for garbage collection will 
overlap with the new generations to be used by the active index.

I'll try to create a test to reproduce the issue.

> DocumentNodeState is null for revision rx-x-x
> ---------------------------------------------
>
>                 Key: OAK-2778
>                 URL: https://issues.apache.org/jira/browse/OAK-2778
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: core, mongomk
>    Affects Versions: 1.0, 1.2
>            Reporter: Marcel Reutegger
>            Assignee: Marcel Reutegger
>             Fix For: 1.3.0
>
>
> On a system running Oak 1.0.12 the following exception is seen repeatedly 
> when the async index update tries to update a lucene index:
> {noformat}
> org.apache.sling.commons.scheduler.impl.QuartzScheduler Exception during job 
> execution of 
> org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate@6be42cde : 
> DocumentNodeState is null for revision r14cbbd50ad2-0-1 of 
> /oak:index/lucene/:data/_1co.cfe (aborting getChildNodes())
> org.apache.jackrabbit.oak.plugins.document.DocumentStoreException: 
> DocumentNodeState is null for revision r14cbbd50ad2-0-1 of 
> /oak:index/lucene/:data/_1co.cfe (aborting getChildNodes())
> at 
> org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore$6.apply(DocumentNodeStore.java:925)
> at 
> org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore$6.apply(DocumentNodeStore.java:919)
> at com.google.common.collect.Iterators$8.transform(Iterators.java:794)
> at 
> com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48)
> at 
> com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48)
> at 
> org.apache.jackrabbit.oak.plugins.document.DocumentNodeState$ChildNodeEntryIterator.next(DocumentNodeState.java:618)
> at 
> org.apache.jackrabbit.oak.plugins.document.DocumentNodeState$ChildNodeEntryIterator.next(DocumentNodeState.java:587)
> at 
> com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48)
> at com.google.common.collect.Iterators.addAll(Iterators.java:357)
> at com.google.common.collect.Lists.newArrayList(Lists.java:146)
> at com.google.common.collect.Iterables.toCollection(Iterables.java:334)
> at com.google.common.collect.Iterables.toArray(Iterables.java:312)
> at 
> org.apache.jackrabbit.oak.plugins.index.lucene.OakDirectory.listAll(OakDirectory.java:69)
> at 
> org.apache.lucene.index.DirectoryReader.indexExists(DirectoryReader.java:339)
> at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:720)
> at 
> org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditorContext.getWriter(LuceneIndexEditorContext.java:134)
> at 
> org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditor.addOrUpdate(LuceneIndexEditor.java:260)
> at 
> org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditor.leave(LuceneIndexEditor.java:171)
> at 
> org.apache.jackrabbit.oak.spi.commit.CompositeEditor.leave(CompositeEditor.java:74)
> at 
> org.apache.jackrabbit.oak.spi.commit.VisibleEditor.leave(VisibleEditor.java:63)
> at 
> org.apache.jackrabbit.oak.spi.commit.EditorDiff.childNodeAdded(EditorDiff.java:130)
> at 
> org.apache.jackrabbit.oak.plugins.memory.EmptyNodeState.compareAgainstEmptyState(EmptyNodeState.java:160)
> {noformat}
> A similar issue was already fixed with OAK-2420.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to