[ 
https://issues.apache.org/jira/browse/OAK-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054133#comment-16054133
 ] 

David Gonzalez commented on OAK-6370:
-------------------------------------

think that clarifies the mechanics of the tool…. but i think clarifications on 
its expected uses cases are warranted?

1) Im moving many NEW files into Oak; this tool presumes that the files are 
already in Oak; so what is the safe way(s) to get them in without incurring the 
cost of text extraction? 
1a) How is 1 handled in the context of a new new Oak repo that is not under use 
(ie. you have more leeway configuring/disabling features)
1b) How is 1 handled if you're migrating large sets into a repo being used (ie. 
you cant turn off text extraction wholesale because someone might be normally 
uploading something and expect it to be indexed right away) 

2) Does this extract text from *everything* when you run it? Can you limit it 
to parts of the JCR? Or by new/modified content?

3) Should this be run on the same machine that is running the Oak repo or on a 
different machine w/ the paths mounted?

4) Would this need to be run on every publish instance? Or could you run this 
once (ex. the CSV extraction) and re-purpose that to save time? (Is the time 
saved meaningful? Or is it preferred to run this process once per oak instance?)

> Improve documentation for text pre-extraction
> ---------------------------------------------
>
>                 Key: OAK-6370
>                 URL: https://issues.apache.org/jira/browse/OAK-6370
>             Project: Jackrabbit Oak
>          Issue Type: Documentation
>          Components: lucene, run
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>             Fix For: 1.8
>
>
> The docs for pre-extraction does not make things very clear. This should be 
> improved



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to