[jira] [Commented] (OAK-6370) Improve documentation for text pre-extraction

2017-06-19 Thread Chetan Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16055169#comment-16055169
 ] 

Chetan Mehrotra commented on OAK-6370:
--

bq.  Im moving many NEW files into Oak; this tool presumes that the files are 
already in Oak; so what is the safe way(s) to get them in without incurring the 
cost of text extraction? 

Pre extraction support is mostly meant to reduce reindexing time. For speeding 
up incremental indexing see OAK-2787. Another option here would be to move the 
text extraction as part of asset processing. This depends on application built 
on top of Oak. For example if there is an asset management system which does 
some post processing upon asset binary addition then it can also do text 
extraction and create a node having plain text data which gets indexed quickly.

bq. Does this extract text from everything when you run it? Can you limit it to 
parts of the JCR? Or by new/modified content?

It takes a "path" argument

bq. Should this be run on the same machine that is running the Oak repo or on a 
different machine w/ the paths mounted?

Need not be. For first phase i.e. csv generation it needs access to NodeStore. 
So if its a SegmentNodeStore based setup then it needs to be run from that 
machine (or its clone). For text extraction phase it needs to just access the 
BlobStore

bq.  Would this need to be run on every publish instance? Or could you run this 
once (ex. the CSV extraction) and re-purpose that to save time? (Is the time 
saved meaningful? Or is it preferred to run this process once per oak instance?)

It works at blobId level which are stable. So run it once per cluster. If the 
publish instance have more or less same data then you need to run it only for 
one of them



> Improve documentation for text pre-extraction
> -
>
> Key: OAK-6370
> URL: https://issues.apache.org/jira/browse/OAK-6370
> Project: Jackrabbit Oak
>  Issue Type: Documentation
>  Components: lucene, run
>Reporter: Chetan Mehrotra
>Assignee: Chetan Mehrotra
> Fix For: 1.8
>
>
> The docs for pre-extraction does not make things very clear. This should be 
> improved



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OAK-6370) Improve documentation for text pre-extraction

2017-06-19 Thread David Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054133#comment-16054133
 ] 

David Gonzalez commented on OAK-6370:
-

think that clarifies the mechanics of the tool…. but i think clarifications on 
its expected uses cases are warranted?

1) Im moving many NEW files into Oak; this tool presumes that the files are 
already in Oak; so what is the safe way(s) to get them in without incurring the 
cost of text extraction? 
1a) How is 1 handled in the context of a new new Oak repo that is not under use 
(ie. you have more leeway configuring/disabling features)
1b) How is 1 handled if you're migrating large sets into a repo being used (ie. 
you cant turn off text extraction wholesale because someone might be normally 
uploading something and expect it to be indexed right away) 

2) Does this extract text from *everything* when you run it? Can you limit it 
to parts of the JCR? Or by new/modified content?

3) Should this be run on the same machine that is running the Oak repo or on a 
different machine w/ the paths mounted?

4) Would this need to be run on every publish instance? Or could you run this 
once (ex. the CSV extraction) and re-purpose that to save time? (Is the time 
saved meaningful? Or is it preferred to run this process once per oak instance?)

> Improve documentation for text pre-extraction
> -
>
> Key: OAK-6370
> URL: https://issues.apache.org/jira/browse/OAK-6370
> Project: Jackrabbit Oak
>  Issue Type: Documentation
>  Components: lucene, run
>Reporter: Chetan Mehrotra
>Assignee: Chetan Mehrotra
> Fix For: 1.8
>
>
> The docs for pre-extraction does not make things very clear. This should be 
> improved



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)