Thomas Mueller created OAK-10341: ------------------------------------ Summary: Indexing: replace FlatFileStore+PersistedLinkedList with a tree store Key: OAK-10341 URL: https://issues.apache.org/jira/browse/OAK-10341 Project: Jackrabbit Oak Issue Type: Improvement Reporter: Thomas Mueller
Currently, for indexing large repositories with the document store, we first read all nodes and write them to a sorted file (sorting and merging when needed). Then we index from that sorted file (called "FlatFileStore"). There are multiple problems with this mechanism: * The last merging stage of the flat file store is actually not needed: we could index from the un-merged streams. It would save one step where we write and read all the data. * It requires to know the aggregation in the index definition, in order to have a set of "preferred children". If this is unknown, then indexing might take nearly infinite time. * Even if it is known, indexing might be very very slow, specially if there are many direct child nodes for some of the nodes that require aggregation. * It requires a PersistedLinkedList to avoid running out of memory. This persisted linked list uses a key-value store internally. This is an additional overhead: we store and read the data again. However, access to that storage is still done using just an iterator, and not with a key lookup. So performance can still be quite bad. * For parallel indexing, we split the flat file. This is not possible unless if we know the aggregation. Sometimes splitting is not possible. We want to explore using a tree store that would solve all of the above problems. -- This message was sent by Atlassian Jira (v8.20.10#820010)