[ 
https://issues.apache.org/jira/browse/OAK-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chetan Mehrotra updated OAK-6513:
---------------------------------
    Description: 
Current async indexer design is based on NodeState diff. This has served us 
fine so far however off late it is not able to perform well if rate of 
repository writes is high. When changes happen faster than index-update can 
process them, larger and larger diffs will happen. These make index-updates 
slower, which again lead to the next diff being ever larger than the one before 
(assuming a constant ingestion rate). 

In current diff based flow the indexer performs complete diff for all changes 
happening between 2 cycle. It may happen that lots of writes happens but not 
much indexable content is written. So doing diff there is a wasted effort.

In 1.6 release for NRT Indexing we implemented a journal based indexing for 
external changes(OAK-4808, OAK-5430). That approach can be generalized and used 
for async indexing. 

Before talking about the journal based approach lets see how IndexEditor work 
currently

h4. IndexEditor 

Currently any IndexEditor performs 2 tasks

# Identify which node is to be indexed based on some index definition. The 
Editor gets invoked as part of content diff where it determines which NodeState 
is to be indexed
# Update the index based on node to be indexed

For e.g. in oak-lucene we have LuceneIndexEditor which identifies the 
NodeStates to be indexed and LuceneDocumentMaker which constructs the Lucene 
Document from NodeState to be indexed. For journal based approach we can 
decouple these 2 parts and thus have 

* IndexEditor - Identifies which all paths need to be indexed for given index 
definition
* IndexUpdater - Updates the index based on given NodeState and its path

h4. High Level Flow

# Session Commit Flow
## Each index type would provide a IndexEditor which would be invoked as part 
of commit (like sync indexes). These IndexEditor would just determine which 
paths needs to be indexed. 
## As part of commit the paths to be indexed would be written to journal. 
# AsyncIndexUpdate flow
## AsyncIndexUpdate would query this journal to fetch all such indexed paths 
between the 2 checkpoints
## Based on the index path data it would invoke the {{IndexUpdater}} to update 
the index for that path
## Merge the index updates

h4. Benefits

Such a design would have following impact

# More work done as part of write
# Marking of indexable content is distributed hence at indexing time lesser 
work to be done
# Indexing can progress in batches 
# The indexers can be called in parallel

h4. Journal Implementation

DocumentNodeStore currently has an in built journal which is being used for NRT 
Indexing. That feature can be exposed as an api. 

For scaling index this design is mostly required for cluster case. So we can 
possibly have both indexing support implemented and use the journal based 
support for DocumentNodeStore setups. Or we can look into implementing such a 
journal for SegmentNodeStore setups also

h4. Open Points

* Journal support in SegmentNodeStore
* Handling deletes. 


  was:
Current async indexer design is based on NodeState diff. This has served us 
fine so far however off late it is not able to perform well if rate of 
repository writes is high. When changes happen faster than index-update can 
process them, larger and larger diffs will happen. These make index-updates 
slower, which again lead to the next diff being ever larger than the one before 
(assuming a constant ingestion rate). 

In current diff based flow the indexer performs complete diff for all changes 
happening between 2 cycle. It may happen that lots of writes happens but not 
much indexable content is written. So doing diff there is a wasted effort.

In 1.6 release for NRT Indexing we implemented a journal based indexing for 
external changes(OAK-4808, OAK-5430). That approach can be generalized and used 
for async indexing. 

Before talking about the journal based approach lets see how IndexEditor work 
currently

h4. IndexEditor 

Currently any IndexEditor performs 2 tasks

# Identify which node is to be indexed based on some index definition. The 
Editor gets invoked as part of content diff where it determines which NodeState 
is to be indexed
# Update the index based on node to be indexed

For e.g. in oak-lucene we have LuceneIndexEditor which identifies the 
NodeStates to be indexed and LuceneDocumentMaker which constructs the Lucene 
Document from NodeState to be indexed. For journal based approach we can 
decouple these 2 parts and thus have 

* IndexEditor - Identifies which all paths need to be indexed for given index 
definition
* IndexUpdater - Updates the index based on given NodeState and its path

h4. High Level Flow

# Session Commit Flow
## Each index type would provide a IndexEditor which would be invoked as part 
of commit (like sync indexes). These IndexEditor would just determine which 
paths needs to be indexed. 
## As part of commit the paths to be indexed would be written to journal. 
# AsyncIndexUpdate flow
## AsyncIndexUpdate would query this journal to fetch all such indexed paths 
between the 2 checkpoints
## Based on the index path data it would invoke the {{IndexUpdater}} to update 
the index for that path
## Merge the index updates

h4. Benefits

Such a design would have following impact

# More work done as part of write
# Marking of indexable content is distributed hence at indexing time lesser 
work to be done
# Indexing can progress in batches 
# The indexers can be called in parallel

h4. Journal Implementation

DocumentNodeStore currently has an in built journal which is being used for NRT 
Indexing. That feature can be exposed as an api. 

For scaling index this design is mostly required for cluster case. So we can 
possibly have both indexing support implemented and use the journal based 
support for DocumentNodeStore setups. Or we can look into implementing such a 
journal for SegmentNodeStore setups also



> Journal based Async Indexer
> ---------------------------
>
>                 Key: OAK-6513
>                 URL: https://issues.apache.org/jira/browse/OAK-6513
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: indexing
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>             Fix For: 1.8
>
>
> Current async indexer design is based on NodeState diff. This has served us 
> fine so far however off late it is not able to perform well if rate of 
> repository writes is high. When changes happen faster than index-update can 
> process them, larger and larger diffs will happen. These make index-updates 
> slower, which again lead to the next diff being ever larger than the one 
> before (assuming a constant ingestion rate). 
> In current diff based flow the indexer performs complete diff for all changes 
> happening between 2 cycle. It may happen that lots of writes happens but not 
> much indexable content is written. So doing diff there is a wasted effort.
> In 1.6 release for NRT Indexing we implemented a journal based indexing for 
> external changes(OAK-4808, OAK-5430). That approach can be generalized and 
> used for async indexing. 
> Before talking about the journal based approach lets see how IndexEditor work 
> currently
> h4. IndexEditor 
> Currently any IndexEditor performs 2 tasks
> # Identify which node is to be indexed based on some index definition. The 
> Editor gets invoked as part of content diff where it determines which 
> NodeState is to be indexed
> # Update the index based on node to be indexed
> For e.g. in oak-lucene we have LuceneIndexEditor which identifies the 
> NodeStates to be indexed and LuceneDocumentMaker which constructs the Lucene 
> Document from NodeState to be indexed. For journal based approach we can 
> decouple these 2 parts and thus have 
> * IndexEditor - Identifies which all paths need to be indexed for given index 
> definition
> * IndexUpdater - Updates the index based on given NodeState and its path
> h4. High Level Flow
> # Session Commit Flow
> ## Each index type would provide a IndexEditor which would be invoked as part 
> of commit (like sync indexes). These IndexEditor would just determine which 
> paths needs to be indexed. 
> ## As part of commit the paths to be indexed would be written to journal. 
> # AsyncIndexUpdate flow
> ## AsyncIndexUpdate would query this journal to fetch all such indexed paths 
> between the 2 checkpoints
> ## Based on the index path data it would invoke the {{IndexUpdater}} to 
> update the index for that path
> ## Merge the index updates
> h4. Benefits
> Such a design would have following impact
> # More work done as part of write
> # Marking of indexable content is distributed hence at indexing time lesser 
> work to be done
> # Indexing can progress in batches 
> # The indexers can be called in parallel
> h4. Journal Implementation
> DocumentNodeStore currently has an in built journal which is being used for 
> NRT Indexing. That feature can be exposed as an api. 
> For scaling index this design is mostly required for cluster case. So we can 
> possibly have both indexing support implemented and use the journal based 
> support for DocumentNodeStore setups. Or we can look into implementing such a 
> journal for SegmentNodeStore setups also
> h4. Open Points
> * Journal support in SegmentNodeStore
> * Handling deletes. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to