[ 
https://issues.apache.org/jira/browse/OAK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587738#comment-14587738
 ] 

Thomas Mueller commented on OAK-2892:
-------------------------------------

It might make sense to store / retrieve some more info. There should be a way 
to distinguish between "not extracted" and "extraction failed" and "extraction 
returned no data". Maybe we should also store the content type (mimetype) if 
available (this sometimes doesn't match the file name suffix), the version and 
type of the text extraction tool used. The information that extraction failed 
(and the reason for it). This is quite a rich set of information, so we should 
probably store the data in some other way, for example as nodes.

I would probably store the data in a async index. This is a new type of index, 
similar to the counter index.



> Speed up lucene indexing post migration by pre extracting the text content 
> from binaries
> ----------------------------------------------------------------------------------------
>
>                 Key: OAK-2892
>                 URL: https://issues.apache.org/jira/browse/OAK-2892
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: lucene, run
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>              Labels: performance
>             Fix For: 1.3.1, 1.0.16
>
>
> While migrating large repositories say having 3 M docs (250k PDF) Lucene 
> indexing takes long time to complete (at time 4 days!). Currently the text 
> extraction logic is coupled with Lucene indexing and hence is performed in a 
> single threaded mode which slows down the indexing process. Further if the 
> reindexing has to be triggered it has to be done all over again.
> To speed up the Lucene indexing we can decouple the text extraction
> from actual indexing. It is partly based on discussion on OAK-2787
> # Introduce a new ExtractedTextProvider which can provide extracted text for 
> a given Blob instance
> # In oak-run introduce a new indexer mode - This would take a path in 
> repository and would then traverse the repository and look for existing 
> binaries and extract text from that
> So before or after migration is done one can run this oak-run tool to create 
> this store which has the text already extracted. Then post startup we need to 
> wire up the ExtractedTextProvider instance (which is backed by the BlobStore 
> populated before) and indexing logic can just get content from that. This 
> would avoid performing expensive text extraction in the indexing thread.
> See discussion thread http://markmail.org/thread/ndlfpkwfgpey6o66



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to