[ https://issues.apache.org/jira/browse/OAK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15467219#comment-15467219 ]
Chetan Mehrotra edited comment on OAK-4412 at 9/6/16 11:58 AM: --------------------------------------------------------------- Planned feature work is now done and [patch|^OAK-4412-v1.diff] is ready for review. h3. A - Purpose Hybrid index provides 2 indexing modes h4. nrt In this mode for each commit Lucene Documents would be created as part of sync commit and would be added to a *local* index asynchronously where the IndexReader would be refreshed with _refresh interval_ of 1 sec h4. sync In this mode the lucene document would be added to index and IndexReader would be *immediately* refreshed. Functionally this would be similar to property index. This mode has lower performance compared to {{nrt}}. This mode should be used for those cases where code expects changes made to session immediately reflected in the query. So if a session set _/a/b/@foo_ to _bar_ and just after session save performs a query for 'bar' and expects /a/n/@foo to be part of result set then this mode should be used. Performance wise this mode is slower and slows down writes compared to 'nrt' The indexes created under hybrid index are local and maintain index data between last async index cycle to most recent commit. Any search would be performed via MultiReader with readers from local index and another from index built as part of async indexing. h3. B - Usage To enable this mode for any index you need to make the {{async}} property as a multi value property with following values * {{async}} = [{{async}}, {{nrt}}] - Enables the NRT mode * {{async}} = [{{async}}, {{sync}}] - Enables the sync mode {{LuceneIndexProviderService}} - Provides some tuning configuration which can be modfied as per setup requirements h4. Implementation Detail Most of the new code lives under {{org.apache.jackrabbit.oak.plugins.index.lucene.hybrid}} package. For any commit involving any index definition marked with {{nrt}} or {{sync}} {{LuceneIndexEditorProvider}} would return a {{LuceneIndexEditor}} backed by {{LocalIndexWriterFactory}}. This factory would use {{LocalIndexWriter}} and stores the prepared {{LuceneDoc}} in {{LuceneDocumentHolder}}. This holder instance is stored as part of {{CommitContext}} (which is stored in {{CommitInfo}} associated with the commit). Once merge is done for that commit the change is picked by {{LocalIndexObserver}} (a sync observer). This observer would then look for {{LuceneDocumentHolder}} and if found would process the {{LuceneDoc}} stored in it * For documents belonging to {{nrt}} mode it would add the docs to {{DocumentQueue}} * For documents belonging ti {{sync}} mode it would directly write the document to {{NRTIndex}} configured for that index {{DocumentQueue}} asynchronously picks up the docs from the queue and then write them to the index. *NRTIndex* On indexing side each index (represented by {{IndexNode}}) has a matching {{NRTIndex}} which is constructed from {{NRTIndexFactory}}. Whenever a new {{IndexNode}} instance is created as a result of change in async index (via {{IndexTracker}}) the factory would create a new {{NRTIndex}} for that. It keeps maximum 2 instance of {{NRTIndex}} and closes and garbage collect older onces. So a {{NRTIndex}} would only have index data for the data indexed between 2 consecutive async indexing cycle. {{NRTIndex}} provides access to {{IndexWriter}} which is used by {{DocumentQueue}} to write documents to it. It also creates {{IndexReader}} which is obtained from {{IndexWriter}} making use of [Lucene NRT Support|http://wiki.apache.org/lucene-java/NearRealtimeSearch] {{NRTIndex}} also provides access to {{ReaderRefreshPolicy}} which determines how and when the reader should be refreshed. The policy instance is also made aware of the changes done to index. For {{nrt}} indexes {{TimedRefreshPolicy}} is used which by default refreshes the reader after 1 sec delay. For {{sync}} index {{RefreshOnWritePolicy}} is used which refreshes the reader after any writes *Avoiding Deletes* The indexing logic avoids deleting any document in Lucene index. So if /a/b/@foo is updated say 3 times between 2 async index cycle * /a/b/@foo = 'x' * /a/b/@foo = 'y' * /a/b/@foo = 'z' Then Lucene index would have 3 documents added (no updated). Then {{LucenePropertyIndex}} would match either of 3 depending on query criteria. Say if query is for foo='x' the {{LucenePropertyIndex}} would return /a/b as part of Cursor. The cursor used is a unique cursor so if Lucene returns three documents then only first one would result in entry to cursor and others would be ignored Later query engine (QE) would evaluate the /a/b against the query criteria as per {{ContentSession}} revision and if node value at that time matches then result would be returned to end user otherwise it would be skipped. So if per current root NodeState /a/b@foo='x' and for a query on foo='y' LucenePropertyIndex returns /a/b then QE would filter out that result So in no case correctness of the result would get affected. This allows us to avoid deleting documents in Lucene index. h3. C - Benchmark A benchmark has been implemented in oak-run under {{HybridIndexTest}}. It creates multiple indexes (_numOfIndexes_ = 10) to simulate a system having multiple indexes defined and then creates node with property {{foo}} being set with value as per enum _Status_. Each thread then creates nodes in breadth first fashion (defaults to 5 child node per node and then for each child node). In addition there is a {{Searcher}} thread which queries for different values and a {{Mutator}} which modifies the values * refreshDeltaMillis - 1000 - Time delay between reader reopen for nrt * asyncInterval - 5 - Time in seconds for async indexer * queueSize - 1000 - Size of queue used by {{DocumentQueue}} * hybridIndexEnabled - Boolean flag. If set to true hybrid index would be used otherwise property index would be used * indexingMode - Defaults to nrt - [nrt/sync] - Which mode to use if hybridIndexEnabled * useOakCodec - Boolean flag if set to true {{oakCodec}} would be used to avoid compression which slows down the searches (OAK-1737) {noformat} java -DhybridIndexEnabled=true -DindexingMode=nrt -jar oak-run*.jar benchmark --concurrency=5 HybridIndexTest Oak-Mongo-FDS Oak-Segment-Tar-FDS {noformat} _Results would be posted soon_ h3. D -Pending Feature Work * Support for listening to external changes and then update the {{nrt}} indexes based on those changes * JMX MBean around NRTIndexFactory to see rate of change etc was (Author: chetanm): Planned feature work is now done and [patch|^OAK-4412-v1.diff] is ready for review. h3. Purpose Hybrid index provides 2 indexing modes h4. nrt In this mode for each commit Lucene Documents would be created as part of sync commit and would be added to a *local* index asynchronously where the IndexReader would be refreshed with _refresh interval_ of 1 sec h4. sync In this mode the lucene document would be added to index and IndexReader would be *immediately* refreshed. Functionally this would be similar to property index. This mode has lower performance compared to {{nrt}}. This mode should be used for those cases where code expects changes made to session immediately reflected in the query. So if a session set _/a/b/@foo_ to _bar_ and just after session save performs a query for 'bar' and expects /a/n/@foo to be part of result set then this mode should be used. Performance wise this mode is slower and slows down writes compared to 'nrt' The indexes created under hybrid index are local and maintain index data between last async index cycle to most recent commit. Any search would be performed via MultiReader with readers from local index and another from index built as part of async indexing. h3. Usage To enable this mode for any index you need to make the {{async}} property as a multi value property with following values * {{async}} = [{{async}}, {{nrt}}] - Enables the NRT mode * {{async}} = [{{async}}, {{sync}}] - Enables the sync mode {{LuceneIndexProviderService}} - Provides some tuning configuration which can be modfied as per setup requirements h4. Implementation Detail Most of the new code lives under {{org.apache.jackrabbit.oak.plugins.index.lucene.hybrid}} package. For any commit involving any index definition marked with {{nrt}} or {{sync}} {{LuceneIndexEditorProvider}} would return a {{LuceneIndexEditor}} backed by {{LocalIndexWriterFactory}}. This factory would use {{LocalIndexWriter}} and stores the prepared {{LuceneDoc}} in {{LuceneDocumentHolder}}. This holder instance is stored as part of {{CommitContext}} (which is stored in {{CommitInfo}} associated with the commit). Once merge is done for that commit the change is picked by {{LocalIndexObserver}} (a sync observer). This observer would then look for {{LuceneDocumentHolder}} and if found would process the {{LuceneDoc}} stored in it * For documents belonging to {{nrt}} mode it would add the docs to {{DocumentQueue}} * For documents belonging ti {{sync}} mode it would directly write the document to {{NRTIndex}} configured for that index {{DocumentQueue}} asynchronously picks up the docs from the queue and then write them to the index. *NRTIndex* On indexing side each index (represented by {{IndexNode}}) has a matching {{NRTIndex}} which is constructed from {{NRTIndexFactory}}. Whenever a new {{IndexNode}} instance is created as a result of change in async index (via {{IndexTracker}}) the factory would create a new {{NRTIndex}} for that. It keeps maximum 2 instance of {{NRTIndex}} and closes and garbage collect older onces. So a {{NRTIndex}} would only have index data for the data indexed between 2 consecutive async indexing cycle. {{NRTIndex}} provides access to {{IndexWriter}} which is used by {{DocumentQueue}} to write documents to it. It also creates {{IndexReader}} which is obtained from {{IndexWriter}} making use of [Lucene NRT Support|http://wiki.apache.org/lucene-java/NearRealtimeSearch] {{NRTIndex}} also provides access to {{ReaderRefreshPolicy}} which determines how and when the reader should be refreshed. The policy instance is also made aware of the changes done to index. For {{nrt}} indexes {{TimedRefreshPolicy}} is used which by default refreshes the reader after 1 sec delay. For {{sync}} index {{RefreshOnWritePolicy}} is used which refreshes the reader after any writes h4. Benchmark A benchmark has been implemented in oak-run under {{HybridIndexTest}}. It creates multiple indexes (_numOfIndexes_ = 10) to simulate a system having multiple indexes defined and then creates node with property {{foo}} being set with value as per enum _Status_. Each thread then creates nodes in breadth first fashion (defaults to 5 child node per node and then for each child node). In addition there is a {{Searcher}} thread which queries for different values and a {{Mutator}} which modifies the values * refreshDeltaMillis - 1000 - Time delay between reader reopen for nrt * asyncInterval - 5 - Time in seconds for async indexer * queueSize - 1000 - Size of queue used by {{DocumentQueue}} * hybridIndexEnabled - Boolean flag. If set to true hybrid index would be used otherwise property index would be used * indexingMode - Defaults to nrt - [nrt/sync] - Which mode to use if hybridIndexEnabled * useOakCodec - Boolean flag if set to true {{oakCodec}} would be used to avoid compression which slows down the searches (OAK-1737) {noformat} java -DhybridIndexEnabled=true -DindexingMode=nrt -jar oak-run*.jar benchmark --concurrency=5 HybridIndexTest Oak-Mongo-FDS Oak-Segment-Tar-FDS {noformat} _Results would be posted soon_ h4. Pending Feature Work * Support for listening to external changes and then update the {{nrt}} indexes based on those changes * JMX MBean around NRTIndexFactory to see rate of change etc > Lucene hybrid index > ------------------- > > Key: OAK-4412 > URL: https://issues.apache.org/jira/browse/OAK-4412 > Project: Jackrabbit Oak > Issue Type: New Feature > Components: lucene > Reporter: Tomek Rękawek > Assignee: Chetan Mehrotra > Fix For: 1.6 > > Attachments: OAK-4412-v1.diff, OAK-4412.patch > > > When running Oak in a cluster, each write operation is expensive. After > performing some stress-tests with a geo-distributed Mongo cluster, we've > found out that updating property indexes is a large part of the overall > traffic. > The asynchronous index would be an answer here (as the index update won't be > made in the client request thread), but the AEM requires the updates to be > visible immediately in order to work properly. > The idea here is to enhance the existing asynchronous Lucene index with a > synchronous, locally-stored counterpart that will persist only the data since > the last Lucene background reindexing job. > The new index can be stored in memory or (if necessary) in MMAPed local > files. Once the "main" Lucene index is being updated, the local index will be > purged. > Queries will use an union of results from the {{lucene}} and > {{lucene-memory}} indexes. > The {{lucene-memory}} index, as a local stored entity, will be updated using > an observer, so it'll get both local and remote changes. > The original idea has been suggested by [~chetanm] in the discussion for the > OAK-4233. -- This message was sent by Atlassian JIRA (v6.3.4#6332)