[ https://issues.apache.org/jira/browse/OAK-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054133#comment-16054133 ]
David Gonzalez commented on OAK-6370: ------------------------------------- think that clarifies the mechanics of the tool…. but i think clarifications on its expected uses cases are warranted? 1) Im moving many NEW files into Oak; this tool presumes that the files are already in Oak; so what is the safe way(s) to get them in without incurring the cost of text extraction? 1a) How is 1 handled in the context of a new new Oak repo that is not under use (ie. you have more leeway configuring/disabling features) 1b) How is 1 handled if you're migrating large sets into a repo being used (ie. you cant turn off text extraction wholesale because someone might be normally uploading something and expect it to be indexed right away) 2) Does this extract text from *everything* when you run it? Can you limit it to parts of the JCR? Or by new/modified content? 3) Should this be run on the same machine that is running the Oak repo or on a different machine w/ the paths mounted? 4) Would this need to be run on every publish instance? Or could you run this once (ex. the CSV extraction) and re-purpose that to save time? (Is the time saved meaningful? Or is it preferred to run this process once per oak instance?) > Improve documentation for text pre-extraction > --------------------------------------------- > > Key: OAK-6370 > URL: https://issues.apache.org/jira/browse/OAK-6370 > Project: Jackrabbit Oak > Issue Type: Documentation > Components: lucene, run > Reporter: Chetan Mehrotra > Assignee: Chetan Mehrotra > Fix For: 1.8 > > > The docs for pre-extraction does not make things very clear. This should be > improved -- This message was sent by Atlassian JIRA (v6.4.14#64029)