[ 
https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243573#comment-16243573
 ] 

Thomas Mueller commented on OAK-5519:
-------------------------------------

My current approach is: extract larger binaries using a separate thread 
(ExecutorService, in ExtractedTextCache). Small binaries (less than 16 KB) are 
still extracted in the regular thread. If extraction takes longer than the 
timeout (1 minute right now), then ignore this binary and continue.

Current behavior:
* When trying to extract a binary that takes very long (or extraction has an 
endless loop), then the thread continues running, but extraction isn't blocked.
* The extraction thread has a "nice" thread name (includes the path of the 
node, binary,...).
* The process can be stopped normally as the extraction thread is a daemon 
thread.
* When restarting the process, extraction of that binary is _not_ retried.

Open points:
* Should log a warning / error that text extraction failed for this binary.
* Add JMX support to detect runaway text extraction threads (in order to 
restart the process or manually stop those threads). 
* Default values should be configurable.
* On endless loop in extraction, currently there are two threads consuming 100% 
each. Instead of just one. Need to investigate (looks like there are two 
caches, which sounds wrong).

> Skip problematic binaries instead of blocking indexing
> ------------------------------------------------------
>
>                 Key: OAK-5519
>                 URL: https://issues.apache.org/jira/browse/OAK-5519
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: indexing
>            Reporter: Alexander Klimetschek
>            Assignee: Thomas Mueller
>              Labels: resilience
>             Fix For: 1.8
>
>
> If a text extraction is blocked (weird PDF) or a blob cannot be found in the 
> datastore or any other error upon indexing one item from the repository that 
> is outside the scope of the indexer, it currently halts the indexing (lane). 
> Thus one item (that maybe isn't important to the users at all) can block the 
> indexing of other, new content (that might be important to users), and it 
> always requires manual intervention  (which is also not easy and requires oak 
> experts).
> Instead, the item could be remembered in a known issue list, proper warnings 
> given, and indexing continue. Maintenance operations should be available to 
> come back to reindex these, or the indexer could automatically retry after 
> some time. This would allow normal user activity to go on without manual 
> intervention, and solving the problem (if it's isolated to some binaries) can 
> be deferred.
> I think the line should probably be drawn for binary properties. Not sure if 
> other JCR property types could trigger a similar issue, and if a failure in 
> them might actually warrant a halt, as it could lead to an "incorrect" index, 
> if these properties are important. But maybe the line is simply a try & catch 
> around "full text extraction".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to