[ https://issues.apache.org/jira/browse/OAK-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Davide Giannella updated OAK-6513: ---------------------------------- Fix Version/s: 1.9.13 > Journal based Async Indexer > --------------------------- > > Key: OAK-6513 > URL: https://issues.apache.org/jira/browse/OAK-6513 > Project: Jackrabbit Oak > Issue Type: New Feature > Components: indexing > Reporter: Chetan Mehrotra > Assignee: Chetan Mehrotra > Priority: Major > Fix For: 1.10, 1.9.12, 1.9.13 > > > Current async indexer design is based on NodeState diff. This has served us > fine so far however off late it is not able to perform well if rate of > repository writes is high. When changes happen faster than index-update can > process them, larger and larger diffs will happen. These make index-updates > slower, which again lead to the next diff being ever larger than the one > before (assuming a constant ingestion rate). > In current diff based flow the indexer performs complete diff for all changes > happening between 2 cycle. It may happen that lots of writes happens but not > much indexable content is written. So doing diff there is a wasted effort. > In 1.6 release for NRT Indexing we implemented a journal based indexing for > external changes(OAK-4808, OAK-5430). That approach can be generalized and > used for async indexing. > Before talking about the journal based approach lets see how IndexEditor work > currently > h4. IndexEditor > Currently any IndexEditor performs 2 tasks > # Identify which node is to be indexed based on some index definition. The > Editor gets invoked as part of content diff where it determines which > NodeState is to be indexed > # Update the index based on node to be indexed > For e.g. in oak-lucene we have LuceneIndexEditor which identifies the > NodeStates to be indexed and LuceneDocumentMaker which constructs the Lucene > Document from NodeState to be indexed. For journal based approach we can > decouple these 2 parts and thus have > * IndexEditor - Identifies which all paths need to be indexed for given index > definition > * IndexUpdater - Updates the index based on given NodeState and its path > h4. High Level Flow > # Session Commit Flow > ## Each index type would provide a IndexEditor which would be invoked as part > of commit (like sync indexes). These IndexEditor would just determine which > paths needs to be indexed. > ## As part of commit the paths to be indexed would be written to journal. > # AsyncIndexUpdate flow > ## AsyncIndexUpdate would query this journal to fetch all such indexed paths > between the 2 checkpoints > ## Based on the index path data it would invoke the {{IndexUpdater}} to > update the index for that path > ## Merge the index updates > h4. Benefits > Such a design would have following impact > # More work done as part of write > # Marking of indexable content is distributed hence at indexing time lesser > work to be done > # Indexing can progress in batches > # The indexers can be called in parallel > h4. Journal Implementation > DocumentNodeStore currently has an in built journal which is being used for > NRT Indexing. That feature can be exposed as an api. > For scaling index this design is mostly required for cluster case. So we can > possibly have both indexing support implemented and use the journal based > support for DocumentNodeStore setups. Or we can look into implementing such a > journal for SegmentNodeStore setups also > h4. Open Points > * Journal support in SegmentNodeStore > * Handling deletes. > Detailed proposal - > https://wiki.apache.org/jackrabbit/Journal%20based%20Async%20Indexer -- This message was sent by Atlassian JIRA (v7.6.3#76005)