[ https://issues.apache.org/jira/browse/OAK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14585526#comment-14585526 ]
Chetan Mehrotra commented on OAK-2892: -------------------------------------- There are 3 parts in this work # Scan binaries and pre extract text - This part if dealt with in OAK-2953 sub task # Persisting extracted text - There can be multiple way to store the extracted text. For migration case we can go for a simple solution where the extracted text is saved as file on the FS similar to FileDataStore. # Read from pre extracted text {code:java} package org.apache.jackrabbit.oak.plugins.index.fulltext; public interface PreExtractedTextProvider { /** * Get pre extracted text for given blob * * @param propertyPath path of the binary property * @param blob binary property value * * @return pre extracted text or null if no pre extracted * text found */ @CheckForNull String getText(String propertyPath, Blob blob); } {code} Of this #1 can be made part of oak-run and it would rely on #2. While #3 would be used by oak-lucene and oak-solr and would also rely on #2. So the proposal is # Add interface in #3 to {{org.apache.jackrabbit.oak.plugins.index.fulltext}} # Provide an implementation for above (and also a writer to be used by oak-run) under {{org.apache.jackrabbit.oak.plugins.blob.datastore}}. The implementation can be enabled via OSGi config # Have LuceneIndexEditor and SolrEditor use the interface in #1 to check if text is already extracted [~tmueller] [~alex.parvulescu] [~teofili] Thoughts for above > Speed up lucene indexing post migration by pre extracting the text content > from binaries > ---------------------------------------------------------------------------------------- > > Key: OAK-2892 > URL: https://issues.apache.org/jira/browse/OAK-2892 > Project: Jackrabbit Oak > Issue Type: New Feature > Components: lucene, run > Reporter: Chetan Mehrotra > Assignee: Chetan Mehrotra > Labels: performance > Fix For: 1.3.1, 1.0.16 > > > While migrating large repositories say having 3 M docs (250k PDF) Lucene > indexing takes long time to complete (at time 4 days!). Currently the text > extraction logic is coupled with Lucene indexing and hence is performed in a > single threaded mode which slows down the indexing process. Further if the > reindexing has to be triggered it has to be done all over again. > To speed up the Lucene indexing we can decouple the text extraction > from actual indexing. It is partly based on discussion on OAK-2787 > # Introduce a new ExtractedTextProvider which can provide extracted text for > a given Blob instance > # In oak-run introduce a new indexer mode - This would take a path in > repository and would then traverse the repository and look for existing > binaries and extract text from that > So before or after migration is done one can run this oak-run tool to create > this store which has the text already extracted. Then post startup we need to > wire up the ExtractedTextProvider instance (which is backed by the BlobStore > populated before) and indexing logic can just get content from that. This > would avoid performing expensive text extraction in the indexing thread. > See discussion thread http://markmail.org/thread/ndlfpkwfgpey6o66 -- This message was sent by Atlassian JIRA (v6.3.4#6332)