[
https://issues.apache.org/jira/browse/OAK-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16055169#comment-16055169
]
Chetan Mehrotra commented on OAK-6370:
--
bq. Im moving many NEW files into Oak; this tool presumes that the files are
already in Oak; so what is the safe way(s) to get them in without incurring the
cost of text extraction?
Pre extraction support is mostly meant to reduce reindexing time. For speeding
up incremental indexing see OAK-2787. Another option here would be to move the
text extraction as part of asset processing. This depends on application built
on top of Oak. For example if there is an asset management system which does
some post processing upon asset binary addition then it can also do text
extraction and create a node having plain text data which gets indexed quickly.
bq. Does this extract text from everything when you run it? Can you limit it to
parts of the JCR? Or by new/modified content?
It takes a "path" argument
bq. Should this be run on the same machine that is running the Oak repo or on a
different machine w/ the paths mounted?
Need not be. For first phase i.e. csv generation it needs access to NodeStore.
So if its a SegmentNodeStore based setup then it needs to be run from that
machine (or its clone). For text extraction phase it needs to just access the
BlobStore
bq. Would this need to be run on every publish instance? Or could you run this
once (ex. the CSV extraction) and re-purpose that to save time? (Is the time
saved meaningful? Or is it preferred to run this process once per oak instance?)
It works at blobId level which are stable. So run it once per cluster. If the
publish instance have more or less same data then you need to run it only for
one of them
> Improve documentation for text pre-extraction
> -
>
> Key: OAK-6370
> URL: https://issues.apache.org/jira/browse/OAK-6370
> Project: Jackrabbit Oak
> Issue Type: Documentation
> Components: lucene, run
>Reporter: Chetan Mehrotra
>Assignee: Chetan Mehrotra
> Fix For: 1.8
>
>
> The docs for pre-extraction does not make things very clear. This should be
> improved
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)