Thomas Mueller created OAK-10341:
------------------------------------

             Summary: Indexing: replace FlatFileStore+PersistedLinkedList with 
a tree store
                 Key: OAK-10341
                 URL: https://issues.apache.org/jira/browse/OAK-10341
             Project: Jackrabbit Oak
          Issue Type: Improvement
            Reporter: Thomas Mueller


Currently, for indexing large repositories with the document store, we first 
read all nodes and write them to a sorted file (sorting and merging when 
needed). Then we index from that sorted file (called "FlatFileStore").

There are multiple problems with this mechanism:
* The last merging stage of the flat file store is actually not needed: we 
could index from the un-merged streams. It would save one step where we write 
and read all the data.
* It requires to know the aggregation in the index definition, in order to have 
a set of "preferred children". If this is unknown, then indexing might take 
nearly infinite time. 
* Even if it is known, indexing might be very very slow, specially if there are 
many direct child nodes for some of the nodes that require aggregation. 
* It requires a PersistedLinkedList to avoid running out of memory. This 
persisted linked list uses a key-value store internally. This is an additional 
overhead: we store and read the data again. However, access to that storage is 
still done using just an iterator, and not with a key lookup. So performance 
can still be quite bad.
* For parallel indexing, we split the flat file. This is not possible unless if 
we know the aggregation. Sometimes splitting is not possible.

We want to explore using a tree store that would solve all of the above 
problems.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to