[ 
https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244308#comment-16244308
 ] 

Chetan Mehrotra commented on OAK-5519:
--------------------------------------

bq. However, after a restart, Oak will not try to extract the same binary 
again, as indexing continued. Except if you upload the same binary to somewhere 
else, but I guess that's rare.

If that binary is getting indexed due to aggregation then it can happen that 
same binary is processed again if any other aggregated property gets modified. 
For e.g. in asset like structure even if the original binary is not touched byt 
some metadata is updated then that would trigger reindexing of same asset 
subtree again triggering text extraction

bq. Well, as you wrote, using the following query I can get the list of 
binaries where exaction failed:

With one caveat that a Lucene document may contain text extracted from multiple 
binaries in case of aggregation (not that big a concern in general as others 
are mostly derived binaries). So this query may flag all binary under a given 
subtree as blacklisted. But to start with this query is useful for case where 
text extraction did not ended up in some infinite loop

> Skip problematic binaries instead of blocking indexing
> ------------------------------------------------------
>
>                 Key: OAK-5519
>                 URL: https://issues.apache.org/jira/browse/OAK-5519
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: indexing
>            Reporter: Alexander Klimetschek
>            Assignee: Thomas Mueller
>              Labels: resilience
>             Fix For: 1.8
>
>
> If a text extraction is blocked (weird PDF) or a blob cannot be found in the 
> datastore or any other error upon indexing one item from the repository that 
> is outside the scope of the indexer, it currently halts the indexing (lane). 
> Thus one item (that maybe isn't important to the users at all) can block the 
> indexing of other, new content (that might be important to users), and it 
> always requires manual intervention  (which is also not easy and requires oak 
> experts).
> Instead, the item could be remembered in a known issue list, proper warnings 
> given, and indexing continue. Maintenance operations should be available to 
> come back to reindex these, or the indexer could automatically retry after 
> some time. This would allow normal user activity to go on without manual 
> intervention, and solving the problem (if it's isolated to some binaries) can 
> be deferred.
> I think the line should probably be drawn for binary properties. Not sure if 
> other JCR property types could trigger a similar issue, and if a failure in 
> them might actually warrant a halt, as it could lead to an "incorrect" index, 
> if these properties are important. But maybe the line is simply a try & catch 
> around "full text extraction".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to