[jira] [Comment Edited] (OAK-5519) Skip problematic binaries instead of blocking indexing

Thomas Mueller (JIRA) Thu, 09 Nov 2017 09:09:01 -0800

    [ 
https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246063#comment-16246063
 ]


Thomas Mueller edited comment on OAK-5519 at 11/9/17 5:07 PM:
--------------------------------------------------------------

http://svn.apache.org/r1814745

[~chetanm] I have incorporated your requests. Features:
* No OSGi / JMX configuration right now, but "emergency" configuration via 
system properties (for example, ability to disable this feature, set 
timeout,...)
* Timeout is 60 seconds.
* Timed out extraction is now stored to a file in the repository / index 
directory, in a properties file named "textExtractionTimeout.properties". 
Example content below. This file is read on startup (and kept fully in memory - 
so better not use that mechanism with large files).
{noformat}
#Text extraction timed out for the following binaries, and will not be retried
#Thu Nov 09 12:33:52 CET 2017
405dfb76526462a6268f1aacb09359179216df423c474b3a1f578b9c567faa35\#190148=TextExtractionError
d19a28de09b655dbe099ee9e72e5bc782088994cca054062213d80b22f2ac67f\#1757777=TextExtractionError
251c6082691578dc1aff306a59984e1b80a79befd8465e158335c5cbfe8bb596\#399142=TextExtractionError
{noformat}
* Failed extraction is cached.
* Number of extractions that timed out can be read via JMX 
(TextExtractionStatsMBean.getTimeoutCount). Each of those threads can consume 
100% CPU (unless they stop at some point).
* It is using its own executor service with daemon threads. This is shut down 
when stopping the service, and restarted when needed. Just one thread usually, 
up to 10 (configurable), so worst case up to 900% CPU usage if 9 extractions 
time out. 
* Thread name is "oak binary text extractor" plus the name of the extracted 
blob (similar to what it was before).
* Only binaries larger than 16 KB are extracted in a separate thread.
* A warning is logged if extraction times out.
* No change for OutOfMemory and so on (Throwable was already caught before this 
patch). So this patch only affects timeouts.


was (Author: tmueller):
http://svn.apache.org/r1814745

[~chetanm] I have incorporated your requests. Features:
* No OSGi / JMX configuration right now, but "emergency" configuration via 
system properties (for example, ability to disable this feature, set 
timeout,...)
* Timeout is 60 seconds.
* Timed out extraction is now stored to a file in the repository / index 
directory, in a properties file named "textExtractionTimeout.properties". 
Example content below. This file is read on startup.
{noformat}
#Text extraction timed out for the following binaries, and will not be retried
#Thu Nov 09 12:33:52 CET 2017
405dfb76526462a6268f1aacb09359179216df423c474b3a1f578b9c567faa35\#190148=TextExtractionError
d19a28de09b655dbe099ee9e72e5bc782088994cca054062213d80b22f2ac67f\#1757777=TextExtractionError
251c6082691578dc1aff306a59984e1b80a79befd8465e158335c5cbfe8bb596\#399142=TextExtractionError
{noformat}
* Failed extraction is cached.
* Number of extractions that timed out can be read via JMX 
(TextExtractionStatsMBean.getTimeoutCount). Each of those threads can consume 
100% CPU (unless they stop at some point).
* It is using its own executor service with daemon threads. This is shut down 
when stopping the service, and restarted when needed. Just one thread usually, 
up to 10 (configurable), so worst case up to 900% CPU usage if 9 extractions 
time out. 
* Thread name is "oak binary text extractor" plus the name of the extracted 
blob (similar to what it was before).
* Only binaries larger than 16 KB are extracted in a separate thread.
* A warning is logged if extraction times out.
* No change for OutOfMemory and so on (Throwable was already caught before this 
patch). So this patch only affects timeouts.

> Skip problematic binaries instead of blocking indexing
> ------------------------------------------------------
>
>                 Key: OAK-5519
>                 URL: https://issues.apache.org/jira/browse/OAK-5519
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: indexing
>            Reporter: Alexander Klimetschek
>            Assignee: Thomas Mueller
>              Labels: resilience
>             Fix For: 1.8, 1.7.12
>
>
> If a text extraction is blocked (weird PDF) or a blob cannot be found in the 
> datastore or any other error upon indexing one item from the repository that 
> is outside the scope of the indexer, it currently halts the indexing (lane). 
> Thus one item (that maybe isn't important to the users at all) can block the 
> indexing of other, new content (that might be important to users), and it 
> always requires manual intervention  (which is also not easy and requires oak 
> experts).
> Instead, the item could be remembered in a known issue list, proper warnings 
> given, and indexing continue. Maintenance operations should be available to 
> come back to reindex these, or the indexer could automatically retry after 
> some time. This would allow normal user activity to go on without manual 
> intervention, and solving the problem (if it's isolated to some binaries) can 
> be deferred.
> I think the line should probably be drawn for binary properties. Not sure if 
> other JCR property types could trigger a similar issue, and if a failure in 
> them might actually warrant a halt, as it could lead to an "incorrect" index, 
> if these properties are important. But maybe the line is simply a try & catch 
> around "full text extraction".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (OAK-5519) Skip problematic binaries instead of blocking indexing

Reply via email to