[ https://issues.apache.org/jira/browse/OAK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Dürig updated OAK-2892: ------------------------------- Fix Version/s: 1.0.16 > Speed up lucene indexing post migration by pre extracting the text content > from binaries > ---------------------------------------------------------------------------------------- > > Key: OAK-2892 > URL: https://issues.apache.org/jira/browse/OAK-2892 > Project: Jackrabbit Oak > Issue Type: New Feature > Components: lucene, run > Reporter: Chetan Mehrotra > Assignee: Chetan Mehrotra > Labels: performance > Fix For: 1.3.1, 1.0.16 > > > While migrating large repositories say having 3 M docs (250k PDF) Lucene > indexing takes long time to complete (at time 4 days!). Currently the text > extraction logic is coupled with Lucene indexing and hence is performed in a > single threaded mode which slows down the indexing process. Further if the > reindexing has to be triggered it has to be done all over again. > To speed up the Lucene indexing we can decouple the text extraction > from actual indexing. It is partly based on discussion on OAK-2787 > # Introduce a new ExtractedTextProvider which can provide extracted text for > a given Blob instance > # In oak-run introduce a new indexer mode - This would take a path in > repository and would then traverse the repository and look for existing > binaries and extract text from that > So before or after migration is done one can run this oak-run tool to create > this store which has the text already extracted. Then post startup we need to > wire up the ExtractedTextProvider instance (which is backed by the BlobStore > populated before) and indexing logic can just get content from that. This > would avoid performing expensive text extraction in the indexing thread. > See discussion thread http://markmail.org/thread/ndlfpkwfgpey6o66 -- This message was sent by Atlassian JIRA (v6.3.4#6332)