[ https://issues.apache.org/jira/browse/OAK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14627615#comment-14627615 ]
Chetan Mehrotra commented on OAK-2892: -------------------------------------- [~reschke] For the test case failure I would followup in OAK-3102. That test is not related to this feature > Speed up lucene indexing post migration by pre extracting the text content > from binaries > ---------------------------------------------------------------------------------------- > > Key: OAK-2892 > URL: https://issues.apache.org/jira/browse/OAK-2892 > Project: Jackrabbit Oak > Issue Type: New Feature > Components: lucene, run > Reporter: Chetan Mehrotra > Assignee: Chetan Mehrotra > Labels: performance > Fix For: 1.2.3, 1.3.3, 1.0.18 > > > While migrating large repositories say having 3 M docs (250k PDF) Lucene > indexing takes long time to complete (at time 4 days!). Currently the text > extraction logic is coupled with Lucene indexing and hence is performed in a > single threaded mode which slows down the indexing process. Further if the > reindexing has to be triggered it has to be done all over again. > To speed up the Lucene indexing we can decouple the text extraction > from actual indexing. It is partly based on discussion on OAK-2787 > # Introduce a new ExtractedTextProvider which can provide extracted text for > a given Blob instance > # In oak-run introduce a new indexer mode - This would take a path in > repository and would then traverse the repository and look for existing > binaries and extract text from that > So before or after migration is done one can run this oak-run tool to create > this store which has the text already extracted. Then post startup we need to > wire up the ExtractedTextProvider instance (which is backed by the BlobStore > populated before) and indexing logic can just get content from that. This > would avoid performing expensive text extraction in the indexing thread. > See discussion thread http://markmail.org/thread/ndlfpkwfgpey6o66 -- This message was sent by Atlassian JIRA (v6.3.4#6332)