[jira] [Commented] (OAK-2892) Speed up lucene indexing post migration by pre extracting the text content from binaries

Chetan Mehrotra (JIRA) Tue, 14 Jul 2015 23:44:24 -0700

    [ 
https://issues.apache.org/jira/browse/OAK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14627615#comment-14627615
 ]


Chetan Mehrotra commented on OAK-2892:
--------------------------------------

[~reschke] For the test case failure I would followup in OAK-3102. That test is 
not related to this feature

> Speed up lucene indexing post migration by pre extracting the text content 
> from binaries
> ----------------------------------------------------------------------------------------
>
>                 Key: OAK-2892
>                 URL: https://issues.apache.org/jira/browse/OAK-2892
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: lucene, run
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>              Labels: performance
>             Fix For: 1.2.3, 1.3.3, 1.0.18
>
>
> While migrating large repositories say having 3 M docs (250k PDF) Lucene 
> indexing takes long time to complete (at time 4 days!). Currently the text 
> extraction logic is coupled with Lucene indexing and hence is performed in a 
> single threaded mode which slows down the indexing process. Further if the 
> reindexing has to be triggered it has to be done all over again.
> To speed up the Lucene indexing we can decouple the text extraction
> from actual indexing. It is partly based on discussion on OAK-2787
> # Introduce a new ExtractedTextProvider which can provide extracted text for 
> a given Blob instance
> # In oak-run introduce a new indexer mode - This would take a path in 
> repository and would then traverse the repository and look for existing 
> binaries and extract text from that
> So before or after migration is done one can run this oak-run tool to create 
> this store which has the text already extracted. Then post startup we need to 
> wire up the ExtractedTextProvider instance (which is backed by the BlobStore 
> populated before) and indexing logic can just get content from that. This 
> would avoid performing expensive text extraction in the indexing thread.
> See discussion thread http://markmail.org/thread/ndlfpkwfgpey6o66



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (OAK-2892) Speed up lucene indexing post migration by pre extracting the text content from binaries

Reply via email to