[ 
https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244076#comment-16244076
 ] 

Thomas Mueller commented on OAK-5519:
-------------------------------------

> Going forward we can probably store some hidden property to mark such 
> binaries to avoid hitting them again (as cache is ephemeral)

That's what I thought as well, but actually, I think this is not needed. When 
adding a bad pdf, text extraction will run, and then timeout, and then the text 
"TextExtractionError" is stored in the fulltext index. Indexing continues. The 
thread will continue to consume 100% CPU until the process is killed or the 
thread is stopped. However, after a restart, Oak will not try to extract the 
same binary again, as indexing continued. Except if you upload the same binary 
to somewhere else, but I guess that's rare.

> We can possibly store some more data/marker in special field which can then 
> later be queried to find out all such files which have not been indexed

Well, as you wrote, using the following query I can get the list of binaries 
where exaction failed:
{noformat}
/jcr:root//*[jcr:contains(., 'textextractionerror')] 
{noformat}

Of course this includes binaries that contain this exact term, but I don't 
think that's a big problem.

> Skip problematic binaries instead of blocking indexing
> ------------------------------------------------------
>
>                 Key: OAK-5519
>                 URL: https://issues.apache.org/jira/browse/OAK-5519
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: indexing
>            Reporter: Alexander Klimetschek
>            Assignee: Thomas Mueller
>              Labels: resilience
>             Fix For: 1.8
>
>
> If a text extraction is blocked (weird PDF) or a blob cannot be found in the 
> datastore or any other error upon indexing one item from the repository that 
> is outside the scope of the indexer, it currently halts the indexing (lane). 
> Thus one item (that maybe isn't important to the users at all) can block the 
> indexing of other, new content (that might be important to users), and it 
> always requires manual intervention  (which is also not easy and requires oak 
> experts).
> Instead, the item could be remembered in a known issue list, proper warnings 
> given, and indexing continue. Maintenance operations should be available to 
> come back to reindex these, or the indexer could automatically retry after 
> some time. This would allow normal user activity to go on without manual 
> intervention, and solving the problem (if it's isolated to some binaries) can 
> be deferred.
> I think the line should probably be drawn for binary properties. Not sure if 
> other JCR property types could trigger a similar issue, and if a failure in 
> them might actually warrant a halt, as it could lead to an "incorrect" index, 
> if these properties are important. But maybe the line is simply a try & catch 
> around "full text extraction".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to