[ 
https://issues.apache.org/jira/browse/OAK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14585526#comment-14585526
 ] 

Chetan Mehrotra commented on OAK-2892:
--------------------------------------

There are 3 parts in this work

# Scan binaries and pre extract text - This part if dealt with in OAK-2953 sub 
task
# Persisting extracted text - There can be multiple way to store the extracted 
text. For migration case we can go for a simple solution where the extracted 
text is saved as file on the FS similar to FileDataStore. 
# Read from pre extracted text
{code:java}
package org.apache.jackrabbit.oak.plugins.index.fulltext;
public interface PreExtractedTextProvider {

    /**
     * Get pre extracted text for given blob
     *
     * @param propertyPath path of the binary property
     * @param blob binary property value
     *
     * @return pre extracted text or null if no pre extracted
     * text found
     */
    @CheckForNull
    String getText(String propertyPath, Blob blob);
}
{code}

Of this #1 can be made part of oak-run and it would rely on #2. While #3 would 
be used by oak-lucene and oak-solr and would also rely on #2. So the proposal 
is 

# Add interface in #3 to {{org.apache.jackrabbit.oak.plugins.index.fulltext}}
# Provide an implementation for above (and also a writer to be used by oak-run) 
under {{org.apache.jackrabbit.oak.plugins.blob.datastore}}. The implementation 
can be enabled via OSGi config
# Have LuceneIndexEditor and SolrEditor use the interface in #1 to check if 
text is already extracted

[~tmueller] [~alex.parvulescu] [~teofili] Thoughts for above

> Speed up lucene indexing post migration by pre extracting the text content 
> from binaries
> ----------------------------------------------------------------------------------------
>
>                 Key: OAK-2892
>                 URL: https://issues.apache.org/jira/browse/OAK-2892
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: lucene, run
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>              Labels: performance
>             Fix For: 1.3.1, 1.0.16
>
>
> While migrating large repositories say having 3 M docs (250k PDF) Lucene 
> indexing takes long time to complete (at time 4 days!). Currently the text 
> extraction logic is coupled with Lucene indexing and hence is performed in a 
> single threaded mode which slows down the indexing process. Further if the 
> reindexing has to be triggered it has to be done all over again.
> To speed up the Lucene indexing we can decouple the text extraction
> from actual indexing. It is partly based on discussion on OAK-2787
> # Introduce a new ExtractedTextProvider which can provide extracted text for 
> a given Blob instance
> # In oak-run introduce a new indexer mode - This would take a path in 
> repository and would then traverse the repository and look for existing 
> binaries and extract text from that
> So before or after migration is done one can run this oak-run tool to create 
> this store which has the text already extracted. Then post startup we need to 
> wire up the ExtractedTextProvider instance (which is backed by the BlobStore 
> populated before) and indexing logic can just get content from that. This 
> would avoid performing expensive text extraction in the indexing thread.
> See discussion thread http://markmail.org/thread/ndlfpkwfgpey6o66



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to