[ https://issues.apache.org/jira/browse/OAK-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Davide Giannella updated OAK-2787: ---------------------------------- Fix Version/s: 1.14.0 > Faster multi threaded indexing / text extraction for binary content > ------------------------------------------------------------------- > > Key: OAK-2787 > URL: https://issues.apache.org/jira/browse/OAK-2787 > Project: Jackrabbit Oak > Issue Type: Wish > Components: lucene > Reporter: Chetan Mehrotra > Priority: Major > Fix For: 1.12.0, 1.14.0 > > > With Lucene based indexing the indexing process is single threaded. This > hamper the indexing of binary content as on a multi processor system only > single thread can be used to perform the indexing > [~ianeboston] Suggested a possible approach [1] involving a 2 phase indexing > # In first phase detect the nodes to be indexed and start the full text > extraction of the binary content. Post extraction save the binary token > stream back to the node as a hidden data. In this phase the node properties > can still be indexed and a marker field would be added to indicate the > fulltext index is still pending > # Later in 2nd phase look for all such Lucene docs and then update them with > the saved token stream > This would allow the text extraction logic to be decouple from Lucene > indexing logic > [1] http://markmail.org/thread/2w5o4bwqsosb6esu -- This message was sent by Atlassian JIRA (v7.6.3#76005)