[jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing

2017-11-10 Thread Thomas Mueller (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247218#comment-16247218
 ] 

Thomas Mueller commented on OAK-5519:
-

[~jsedding] This only works if text extraction is reading, but in my case it's 
in an endless loop that doesn't read. Even Thread.interrupt() won't work in 
that case. However, you can try Thread.stop() if you are adventurous.

> Skip problematic binaries instead of blocking indexing
> --
>
> Key: OAK-5519
> URL: https://issues.apache.org/jira/browse/OAK-5519
> Project: Jackrabbit Oak
>  Issue Type: New Feature
>  Components: indexing
>Reporter: Alexander Klimetschek
>Assignee: Thomas Mueller
>  Labels: resilience
> Fix For: 1.8, 1.7.12
>
>
> If a text extraction is blocked (weird PDF) or a blob cannot be found in the 
> datastore or any other error upon indexing one item from the repository that 
> is outside the scope of the indexer, it currently halts the indexing (lane). 
> Thus one item (that maybe isn't important to the users at all) can block the 
> indexing of other, new content (that might be important to users), and it 
> always requires manual intervention  (which is also not easy and requires oak 
> experts).
> Instead, the item could be remembered in a known issue list, proper warnings 
> given, and indexing continue. Maintenance operations should be available to 
> come back to reindex these, or the indexer could automatically retry after 
> some time. This would allow normal user activity to go on without manual 
> intervention, and solving the problem (if it's isolated to some binaries) can 
> be deferred.
> I think the line should probably be drawn for binary properties. Not sure if 
> other JCR property types could trigger a similar issue, and if a failure in 
> them might actually warrant a halt, as it could lead to an "incorrect" index, 
> if these properties are important. But maybe the line is simply a try & catch 
> around "full text extraction".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing

2017-11-10 Thread Julian Sedding (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247189#comment-16247189
 ] 

Julian Sedding commented on OAK-5519:
-

[~tmueller] could the processing thread be terminated by closing the stream 
after the timeout? I suppose that should trip up the parser and cause an 
{{IOException}} on the next read. Granted, I don't understand the full 
background of this issue. Maybe the endless loop scenarios don't read from the 
stream any more.

> Skip problematic binaries instead of blocking indexing
> --
>
> Key: OAK-5519
> URL: https://issues.apache.org/jira/browse/OAK-5519
> Project: Jackrabbit Oak
>  Issue Type: New Feature
>  Components: indexing
>Reporter: Alexander Klimetschek
>Assignee: Thomas Mueller
>  Labels: resilience
> Fix For: 1.8, 1.7.12
>
>
> If a text extraction is blocked (weird PDF) or a blob cannot be found in the 
> datastore or any other error upon indexing one item from the repository that 
> is outside the scope of the indexer, it currently halts the indexing (lane). 
> Thus one item (that maybe isn't important to the users at all) can block the 
> indexing of other, new content (that might be important to users), and it 
> always requires manual intervention  (which is also not easy and requires oak 
> experts).
> Instead, the item could be remembered in a known issue list, proper warnings 
> given, and indexing continue. Maintenance operations should be available to 
> come back to reindex these, or the indexer could automatically retry after 
> some time. This would allow normal user activity to go on without manual 
> intervention, and solving the problem (if it's isolated to some binaries) can 
> be deferred.
> I think the line should probably be drawn for binary properties. Not sure if 
> other JCR property types could trigger a similar issue, and if a failure in 
> them might actually warrant a halt, as it could lead to an "incorrect" index, 
> if these properties are important. But maybe the line is simply a try & catch 
> around "full text extraction".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing

2017-11-09 Thread Thomas Mueller (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246063#comment-16246063
 ] 

Thomas Mueller commented on OAK-5519:
-

http://svn.apache.org/r1814745

[~chetanm] I have incorporated your requests. Features:
* No OSGi / JMX configuration right now, but "emergency" configuration via 
system properties (for example, ability to disable this feature, set 
timeout,...)
* Timeout is 60 seconds.
* Timed out extraction is now stored to a file in the repository / index 
directory, in a properties file named "textExtractionTimeout.properties". 
Example content below. This file is read on startup.
{noformat}
#Text extraction timed out for the following binaries, and will not be retried
#Thu Nov 09 12:33:52 CET 2017
405dfb76526462a6268f1aacb09359179216df423c474b3a1f578b9c567faa35\#190148=TextExtractionError
d19a28de09b655dbe099ee9e72e5bc782088994cca054062213d80b22f2ac67f\#175=TextExtractionError
251c6082691578dc1aff306a59984e1b80a79befd8465e158335c5cbfe8bb596\#399142=TextExtractionError
{noformat}
* Failed extraction is cached.
* Number of extractions that timed out can be read via JMX 
(TextExtractionStatsMBean.getTimeoutCount). Each of those threads can consume 
100% CPU (unless they stop at some point).
* It is using its own executor service with daemon threads. This is shut down 
when stopping the service, and restarted when needed. Just one thread usually, 
up to 10 (configurable), so worst case up to 900% CPU usage if 9 extractions 
time out. 
* Thread name is "oak binary text extractor" plus the name of the extracted 
blob (similar to what it was before).
* Only binaries larger than 16 KB are extracted in a separate thread.
* A warning is logged if extraction times out.
* No change for OutOfMemory and so on (Throwable was already caught before this 
patch). So this patch only affects timeouts.

> Skip problematic binaries instead of blocking indexing
> --
>
> Key: OAK-5519
> URL: https://issues.apache.org/jira/browse/OAK-5519
> Project: Jackrabbit Oak
>  Issue Type: New Feature
>  Components: indexing
>Reporter: Alexander Klimetschek
>Assignee: Thomas Mueller
>  Labels: resilience
> Fix For: 1.8
>
>
> If a text extraction is blocked (weird PDF) or a blob cannot be found in the 
> datastore or any other error upon indexing one item from the repository that 
> is outside the scope of the indexer, it currently halts the indexing (lane). 
> Thus one item (that maybe isn't important to the users at all) can block the 
> indexing of other, new content (that might be important to users), and it 
> always requires manual intervention  (which is also not easy and requires oak 
> experts).
> Instead, the item could be remembered in a known issue list, proper warnings 
> given, and indexing continue. Maintenance operations should be available to 
> come back to reindex these, or the indexer could automatically retry after 
> some time. This would allow normal user activity to go on without manual 
> intervention, and solving the problem (if it's isolated to some binaries) can 
> be deferred.
> I think the line should probably be drawn for binary properties. Not sure if 
> other JCR property types could trigger a similar issue, and if a failure in 
> them might actually warrant a halt, as it could lead to an "incorrect" index, 
> if these properties are important. But maybe the line is simply a try & catch 
> around "full text extraction".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing

2017-11-08 Thread Chetan Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244308#comment-16244308
 ] 

Chetan Mehrotra commented on OAK-5519:
--

bq. However, after a restart, Oak will not try to extract the same binary 
again, as indexing continued. Except if you upload the same binary to somewhere 
else, but I guess that's rare.

If that binary is getting indexed due to aggregation then it can happen that 
same binary is processed again if any other aggregated property gets modified. 
For e.g. in asset like structure even if the original binary is not touched byt 
some metadata is updated then that would trigger reindexing of same asset 
subtree again triggering text extraction

bq. Well, as you wrote, using the following query I can get the list of 
binaries where exaction failed:

With one caveat that a Lucene document may contain text extracted from multiple 
binaries in case of aggregation (not that big a concern in general as others 
are mostly derived binaries). So this query may flag all binary under a given 
subtree as blacklisted. But to start with this query is useful for case where 
text extraction did not ended up in some infinite loop

> Skip problematic binaries instead of blocking indexing
> --
>
> Key: OAK-5519
> URL: https://issues.apache.org/jira/browse/OAK-5519
> Project: Jackrabbit Oak
>  Issue Type: New Feature
>  Components: indexing
>Reporter: Alexander Klimetschek
>Assignee: Thomas Mueller
>  Labels: resilience
> Fix For: 1.8
>
>
> If a text extraction is blocked (weird PDF) or a blob cannot be found in the 
> datastore or any other error upon indexing one item from the repository that 
> is outside the scope of the indexer, it currently halts the indexing (lane). 
> Thus one item (that maybe isn't important to the users at all) can block the 
> indexing of other, new content (that might be important to users), and it 
> always requires manual intervention  (which is also not easy and requires oak 
> experts).
> Instead, the item could be remembered in a known issue list, proper warnings 
> given, and indexing continue. Maintenance operations should be available to 
> come back to reindex these, or the indexer could automatically retry after 
> some time. This would allow normal user activity to go on without manual 
> intervention, and solving the problem (if it's isolated to some binaries) can 
> be deferred.
> I think the line should probably be drawn for binary properties. Not sure if 
> other JCR property types could trigger a similar issue, and if a failure in 
> them might actually warrant a halt, as it could lead to an "incorrect" index, 
> if these properties are important. But maybe the line is simply a try & catch 
> around "full text extraction".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing

2017-11-08 Thread Thomas Mueller (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244076#comment-16244076
 ] 

Thomas Mueller commented on OAK-5519:
-

> Going forward we can probably store some hidden property to mark such 
> binaries to avoid hitting them again (as cache is ephemeral)

That's what I thought as well, but actually, I think this is not needed. When 
adding a bad pdf, text extraction will run, and then timeout, and then the text 
"TextExtractionError" is stored in the fulltext index. Indexing continues. The 
thread will continue to consume 100% CPU until the process is killed or the 
thread is stopped. However, after a restart, Oak will not try to extract the 
same binary again, as indexing continued. Except if you upload the same binary 
to somewhere else, but I guess that's rare.

> We can possibly store some more data/marker in special field which can then 
> later be queried to find out all such files which have not been indexed

Well, as you wrote, using the following query I can get the list of binaries 
where exaction failed:
{noformat}
/jcr:root//*[jcr:contains(., 'textextractionerror')] 
{noformat}

Of course this includes binaries that contain this exact term, but I don't 
think that's a big problem.

> Skip problematic binaries instead of blocking indexing
> --
>
> Key: OAK-5519
> URL: https://issues.apache.org/jira/browse/OAK-5519
> Project: Jackrabbit Oak
>  Issue Type: New Feature
>  Components: indexing
>Reporter: Alexander Klimetschek
>Assignee: Thomas Mueller
>  Labels: resilience
> Fix For: 1.8
>
>
> If a text extraction is blocked (weird PDF) or a blob cannot be found in the 
> datastore or any other error upon indexing one item from the repository that 
> is outside the scope of the indexer, it currently halts the indexing (lane). 
> Thus one item (that maybe isn't important to the users at all) can block the 
> indexing of other, new content (that might be important to users), and it 
> always requires manual intervention  (which is also not easy and requires oak 
> experts).
> Instead, the item could be remembered in a known issue list, proper warnings 
> given, and indexing continue. Maintenance operations should be available to 
> come back to reindex these, or the indexer could automatically retry after 
> some time. This would allow normal user activity to go on without manual 
> intervention, and solving the problem (if it's isolated to some binaries) can 
> be deferred.
> I think the line should probably be drawn for binary properties. Not sure if 
> other JCR property types could trigger a similar issue, and if a failure in 
> them might actually warrant a halt, as it could lead to an "incorrect" index, 
> if these properties are important. But maybe the line is simply a try & catch 
> around "full text extraction".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing

2017-11-08 Thread Chetan Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243754#comment-16243754
 ] 

Chetan Mehrotra commented on OAK-5519:
--

bq.  the text extraction cache only puts results in the cache if extraction was 
successful. I wonder why that is, it seems failure should also be cached.

+1. Note that currently if a file text extraction fails we store a sentinel 
value "TextExtractionError" to indicate that there was error processing that. 

Thinking out loud - Going forward we can probably store some hidden property to 
mark such binaries to avoid hitting them again (as cache is ephermal). However 
this would be tricky as IndexEditors currently do not have access to 
NodeBuilder for that node. May be we can store it in index data in some form 
(flat file?) 

> Skip problematic binaries instead of blocking indexing
> --
>
> Key: OAK-5519
> URL: https://issues.apache.org/jira/browse/OAK-5519
> Project: Jackrabbit Oak
>  Issue Type: New Feature
>  Components: indexing
>Reporter: Alexander Klimetschek
>Assignee: Thomas Mueller
>  Labels: resilience
> Fix For: 1.8
>
>
> If a text extraction is blocked (weird PDF) or a blob cannot be found in the 
> datastore or any other error upon indexing one item from the repository that 
> is outside the scope of the indexer, it currently halts the indexing (lane). 
> Thus one item (that maybe isn't important to the users at all) can block the 
> indexing of other, new content (that might be important to users), and it 
> always requires manual intervention  (which is also not easy and requires oak 
> experts).
> Instead, the item could be remembered in a known issue list, proper warnings 
> given, and indexing continue. Maintenance operations should be available to 
> come back to reindex these, or the indexer could automatically retry after 
> some time. This would allow normal user activity to go on without manual 
> intervention, and solving the problem (if it's isolated to some binaries) can 
> be deferred.
> I think the line should probably be drawn for binary properties. Not sure if 
> other JCR property types could trigger a similar issue, and if a failure in 
> them might actually warrant a halt, as it could lead to an "incorrect" index, 
> if these properties are important. But maybe the line is simply a try & catch 
> around "full text extraction".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing

2017-11-08 Thread Thomas Mueller (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243702#comment-16243702
 ] 

Thomas Mueller commented on OAK-5519:
-

I found out why there are two threads consuming 100% each, and not just one: 
the text extraction cache only puts results in the cache if extraction was 
successful. I wonder why that is, it seems failure should also be cached. What 
do you think, [~chetanm], [~catholicon]?

{noformat}
 public void put(@Nonnull Blob blob, @Nonnull ExtractedText extractedText) {
String id = blob.getContentIdentity();
if (extractedText.getExtractionResult() == 
ExtractedText.ExtractionResult.SUCCESS && ...) {
cache.put(id, extractedText.getExtractedText().toString());
}
}
{noformat}

> Skip problematic binaries instead of blocking indexing
> --
>
> Key: OAK-5519
> URL: https://issues.apache.org/jira/browse/OAK-5519
> Project: Jackrabbit Oak
>  Issue Type: New Feature
>  Components: indexing
>Reporter: Alexander Klimetschek
>Assignee: Thomas Mueller
>  Labels: resilience
> Fix For: 1.8
>
>
> If a text extraction is blocked (weird PDF) or a blob cannot be found in the 
> datastore or any other error upon indexing one item from the repository that 
> is outside the scope of the indexer, it currently halts the indexing (lane). 
> Thus one item (that maybe isn't important to the users at all) can block the 
> indexing of other, new content (that might be important to users), and it 
> always requires manual intervention  (which is also not easy and requires oak 
> experts).
> Instead, the item could be remembered in a known issue list, proper warnings 
> given, and indexing continue. Maintenance operations should be available to 
> come back to reindex these, or the indexer could automatically retry after 
> some time. This would allow normal user activity to go on without manual 
> intervention, and solving the problem (if it's isolated to some binaries) can 
> be deferred.
> I think the line should probably be drawn for binary properties. Not sure if 
> other JCR property types could trigger a similar issue, and if a failure in 
> them might actually warrant a halt, as it could lead to an "incorrect" index, 
> if these properties are important. But maybe the line is simply a try & catch 
> around "full text extraction".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing

2017-11-08 Thread Thomas Mueller (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243573#comment-16243573
 ] 

Thomas Mueller commented on OAK-5519:
-

My current approach is: extract larger binaries using a separate thread 
(ExecutorService, in ExtractedTextCache). Small binaries (less than 16 KB) are 
still extracted in the regular thread. If extraction takes longer than the 
timeout (1 minute right now), then ignore this binary and continue.

Current behavior:
* When trying to extract a binary that takes very long (or extraction has an 
endless loop), then the thread continues running, but extraction isn't blocked.
* The extraction thread has a "nice" thread name (includes the path of the 
node, binary,...).
* The process can be stopped normally as the extraction thread is a daemon 
thread.
* When restarting the process, extraction of that binary is _not_ retried.

Open points:
* Should log a warning / error that text extraction failed for this binary.
* Add JMX support to detect runaway text extraction threads (in order to 
restart the process or manually stop those threads). 
* Default values should be configurable.
* On endless loop in extraction, currently there are two threads consuming 100% 
each. Instead of just one. Need to investigate (looks like there are two 
caches, which sounds wrong).

> Skip problematic binaries instead of blocking indexing
> --
>
> Key: OAK-5519
> URL: https://issues.apache.org/jira/browse/OAK-5519
> Project: Jackrabbit Oak
>  Issue Type: New Feature
>  Components: indexing
>Reporter: Alexander Klimetschek
>Assignee: Thomas Mueller
>  Labels: resilience
> Fix For: 1.8
>
>
> If a text extraction is blocked (weird PDF) or a blob cannot be found in the 
> datastore or any other error upon indexing one item from the repository that 
> is outside the scope of the indexer, it currently halts the indexing (lane). 
> Thus one item (that maybe isn't important to the users at all) can block the 
> indexing of other, new content (that might be important to users), and it 
> always requires manual intervention  (which is also not easy and requires oak 
> experts).
> Instead, the item could be remembered in a known issue list, proper warnings 
> given, and indexing continue. Maintenance operations should be available to 
> come back to reindex these, or the indexer could automatically retry after 
> some time. This would allow normal user activity to go on without manual 
> intervention, and solving the problem (if it's isolated to some binaries) can 
> be deferred.
> I think the line should probably be drawn for binary properties. Not sure if 
> other JCR property types could trigger a similar issue, and if a failure in 
> them might actually warrant a halt, as it could lead to an "incorrect" index, 
> if these properties are important. But maybe the line is simply a try & catch 
> around "full text extraction".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing

2017-07-28 Thread Chetan Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16104727#comment-16104727
 ] 

Chetan Mehrotra commented on OAK-5519:
--

bq.  it does nothing except throw an exception / error / out of memory error 
every time

Currently the 
[BinaryTextExtractor|https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/binary/BinaryTextExtractor.java#L161]
 is capturing throwable while calling Tika (thinking about it we should handle 
it better and distinguish between serious error). 

The case which is not handled is where parser enters into some infinite loop 
while processing and for which the process needs to be killed. May be we use an 
executor to handle text extraction and on indexer thread submit a job to it 
with some timeout. In case of any such error the executor pool thread would get 
blocked but indexer thread can continue and system admin can look into 
restarting the process. This would allow us to detect such problamatic binaries 
more reliably and we only need to remember them in case of timeout or some 
exception in processing.

Thoughts?

> Skip problematic binaries instead of blocking indexing
> --
>
> Key: OAK-5519
> URL: https://issues.apache.org/jira/browse/OAK-5519
> Project: Jackrabbit Oak
>  Issue Type: New Feature
>  Components: indexing
>Reporter: Alexander Klimetschek
>  Labels: resilience
> Fix For: 1.8
>
>
> If a text extraction is blocked (weird PDF) or a blob cannot be found in the 
> datastore or any other error upon indexing one item from the repository that 
> is outside the scope of the indexer, it currently halts the indexing (lane). 
> Thus one item (that maybe isn't important to the users at all) can block the 
> indexing of other, new content (that might be important to users), and it 
> always requires manual intervention  (which is also not easy and requires oak 
> experts).
> Instead, the item could be remembered in a known issue list, proper warnings 
> given, and indexing continue. Maintenance operations should be available to 
> come back to reindex these, or the indexer could automatically retry after 
> some time. This would allow normal user activity to go on without manual 
> intervention, and solving the problem (if it's isolated to some binaries) can 
> be deferred.
> I think the line should probably be drawn for binary properties. Not sure if 
> other JCR property types could trigger a similar issue, and if a failure in 
> them might actually warrant a halt, as it could lead to an "incorrect" index, 
> if these properties are important. But maybe the line is simply a try & catch 
> around "full text extraction".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing

2017-07-26 Thread Thomas Mueller (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16101666#comment-16101666
 ] 

Thomas Mueller commented on OAK-5519:
-

[~catholicon] and [~chetanm] I think we should try the "Memory of bad file" 
solution, if that's simple. 

I assume we could write a test case first, that uses a "custom" Tika config as 
documented in 
http://jackrabbit.apache.org/oak/docs/query/lucene.html#Tika_Config, custom in 
that it does nothing except throw an exception / error / out of memory error 
every time. Then try if this runs into an endless loop. Then remember the file 
if it fails *three times* in a row. I think it would be better to wait three 
times, because the first time might be due to a non-repeatable problems (out of 
memory caused by another thread).

> Skip problematic binaries instead of blocking indexing
> --
>
> Key: OAK-5519
> URL: https://issues.apache.org/jira/browse/OAK-5519
> Project: Jackrabbit Oak
>  Issue Type: New Feature
>  Components: indexing
>Reporter: Alexander Klimetschek
>  Labels: resilience
> Fix For: 1.8
>
>
> If a text extraction is blocked (weird PDF) or a blob cannot be found in the 
> datastore or any other error upon indexing one item from the repository that 
> is outside the scope of the indexer, it currently halts the indexing (lane). 
> Thus one item (that maybe isn't important to the users at all) can block the 
> indexing of other, new content (that might be important to users), and it 
> always requires manual intervention  (which is also not easy and requires oak 
> experts).
> Instead, the item could be remembered in a known issue list, proper warnings 
> given, and indexing continue. Maintenance operations should be available to 
> come back to reindex these, or the indexer could automatically retry after 
> some time. This would allow normal user activity to go on without manual 
> intervention, and solving the problem (if it's isolated to some binaries) can 
> be deferred.
> I think the line should probably be drawn for binary properties. Not sure if 
> other JCR property types could trigger a similar issue, and if a failure in 
> them might actually warrant a halt, as it could lead to an "incorrect" index, 
> if these properties are important. But maybe the line is simply a try & catch 
> around "full text extraction".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing

2017-05-04 Thread Thomas Mueller (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15996500#comment-15996500
 ] 

Thomas Mueller commented on OAK-5519:
-

Do we have a test case (for example a PDF file that runs out of memory no 
matter how much heap is available)?

> Skip problematic binaries instead of blocking indexing
> --
>
> Key: OAK-5519
> URL: https://issues.apache.org/jira/browse/OAK-5519
> Project: Jackrabbit Oak
>  Issue Type: New Feature
>  Components: indexing
>Reporter: Alexander Klimetschek
>  Labels: resilience
> Fix For: 1.8
>
>
> If a text extraction is blocked (weird PDF) or a blob cannot be found in the 
> datastore or any other error upon indexing one item from the repository that 
> is outside the scope of the indexer, it currently halts the indexing (lane). 
> Thus one item (that maybe isn't important to the users at all) can block the 
> indexing of other, new content (that might be important to users), and it 
> always requires manual intervention  (which is also not easy and requires oak 
> experts).
> Instead, the item could be remembered in a known issue list, proper warnings 
> given, and indexing continue. Maintenance operations should be available to 
> come back to reindex these, or the indexer could automatically retry after 
> some time. This would allow normal user activity to go on without manual 
> intervention, and solving the problem (if it's isolated to some binaries) can 
> be deferred.
> I think the line should probably be drawn for binary properties. Not sure if 
> other JCR property types could trigger a similar issue, and if a failure in 
> them might actually warrant a halt, as it could lead to an "incorrect" index, 
> if these properties are important. But maybe the line is simply a try & catch 
> around "full text extraction".



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing

2017-04-20 Thread Chetan Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15976491#comment-15976491
 ] 

Chetan Mehrotra commented on OAK-5519:
--

*Problematic Binary Handling* 

h3. A - Out of process 

Best solution for this case is making use of out of process text extraction 
i.e. TIKA-416 which was used in JR2 JCR-2864. My hunch is that this might not 
work in OSGi deployment as this implementation relies on [classloader which 
copies the class content to child process 
classloader|https://jukkaz.wordpress.com/2010/05/27/forking-a-jvm/].

Also be aware of TIKA-591 here

So something to try!

h3. B - Memory of bad file

Untill #A can be implement we can implement some support where we "memorize" 
the last file being processed. Main problem handling such files is that due to 
bug in parser we may end in infinite loop or out of memory. In both cases 
current context is lost and if index starts again it would again hit the same 
file as it would not remember that it was a bad file.

One way would be to record the file for which text is to be extracted in a file 
on filesystem and in case of unclean end use this file to find out the last 
file which was being processed and then exclude that

> Skip problematic binaries instead of blocking indexing
> --
>
> Key: OAK-5519
> URL: https://issues.apache.org/jira/browse/OAK-5519
> Project: Jackrabbit Oak
>  Issue Type: New Feature
>  Components: indexing
>Reporter: Alexander Klimetschek
>  Labels: resilience
>
> If a text extraction is blocked (weird PDF) or a blob cannot be found in the 
> datastore or any other error upon indexing one item from the repository that 
> is outside the scope of the indexer, it currently halts the indexing (lane). 
> Thus one item (that maybe isn't important to the users at all) can block the 
> indexing of other, new content (that might be important to users), and it 
> always requires manual intervention  (which is also not easy and requires oak 
> experts).
> Instead, the item could be remembered in a known issue list, proper warnings 
> given, and indexing continue. Maintenance operations should be available to 
> come back to reindex these, or the indexer could automatically retry after 
> some time. This would allow normal user activity to go on without manual 
> intervention, and solving the problem (if it's isolated to some binaries) can 
> be deferred.
> I think the line should probably be drawn for binary properties. Not sure if 
> other JCR property types could trigger a similar issue, and if a failure in 
> them might actually warrant a halt, as it could lead to an "incorrect" index, 
> if these properties are important. But maybe the line is simply a try & catch 
> around "full text extraction".



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing

2017-04-20 Thread Chetan Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15976472#comment-15976472
 ] 

Chetan Mehrotra commented on OAK-5519:
--

bq.  It probably makes sense to deal with OOME as well (at least catch it and 
log the stack trace).

Would it be ok to log and still continue with indexing?

> Skip problematic binaries instead of blocking indexing
> --
>
> Key: OAK-5519
> URL: https://issues.apache.org/jira/browse/OAK-5519
> Project: Jackrabbit Oak
>  Issue Type: New Feature
>  Components: indexing
>Reporter: Alexander Klimetschek
>  Labels: resilience
>
> If a text extraction is blocked (weird PDF) or a blob cannot be found in the 
> datastore or any other error upon indexing one item from the repository that 
> is outside the scope of the indexer, it currently halts the indexing (lane). 
> Thus one item (that maybe isn't important to the users at all) can block the 
> indexing of other, new content (that might be important to users), and it 
> always requires manual intervention  (which is also not easy and requires oak 
> experts).
> Instead, the item could be remembered in a known issue list, proper warnings 
> given, and indexing continue. Maintenance operations should be available to 
> come back to reindex these, or the indexer could automatically retry after 
> some time. This would allow normal user activity to go on without manual 
> intervention, and solving the problem (if it's isolated to some binaries) can 
> be deferred.
> I think the line should probably be drawn for binary properties. Not sure if 
> other JCR property types could trigger a similar issue, and if a failure in 
> them might actually warrant a halt, as it could lead to an "incorrect" index, 
> if these properties are important. But maybe the line is simply a try & catch 
> around "full text extraction".



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing

2017-04-20 Thread Thomas Mueller (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15976382#comment-15976382
 ] 

Thomas Mueller commented on OAK-5519:
-

I recently saw OutOfMemory error during the index update; I'm not sure if 
that's caused by a problematic binary, a bug in the PDF text extraction tool, 
or something else. It probably makes sense to deal with OOME as well (at least 
catch it and log the stack trace).

> Skip problematic binaries instead of blocking indexing
> --
>
> Key: OAK-5519
> URL: https://issues.apache.org/jira/browse/OAK-5519
> Project: Jackrabbit Oak
>  Issue Type: New Feature
>  Components: indexing
>Reporter: Alexander Klimetschek
>  Labels: resilience
>
> If a text extraction is blocked (weird PDF) or a blob cannot be found in the 
> datastore or any other error upon indexing one item from the repository that 
> is outside the scope of the indexer, it currently halts the indexing (lane). 
> Thus one item (that maybe isn't important to the users at all) can block the 
> indexing of other, new content (that might be important to users), and it 
> always requires manual intervention  (which is also not easy and requires oak 
> experts).
> Instead, the item could be remembered in a known issue list, proper warnings 
> given, and indexing continue. Maintenance operations should be available to 
> come back to reindex these, or the indexer could automatically retry after 
> some time. This would allow normal user activity to go on without manual 
> intervention, and solving the problem (if it's isolated to some binaries) can 
> be deferred.
> I think the line should probably be drawn for binary properties. Not sure if 
> other JCR property types could trigger a similar issue, and if a failure in 
> them might actually warrant a halt, as it could lead to an "incorrect" index, 
> if these properties are important. But maybe the line is simply a try & catch 
> around "full text extraction".



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing

2017-01-26 Thread Alexander Klimetschek (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15840316#comment-15840316
 ] 

Alexander Klimetschek commented on OAK-5519:


Related issues:
* OAK-4939 addresses this in 1.5 and 1.6, but considers the entire index 
"corrupted" and isolates it; if this is an important full text index, then it 
would still impact users as they won't find the other content (that is fine)
* OAK-3813 that I reported earlier which is about datastore failing to resolve 
blobs (in this case S3 where you might have more failure scenarios)


> Skip problematic binaries instead of blocking indexing
> --
>
> Key: OAK-5519
> URL: https://issues.apache.org/jira/browse/OAK-5519
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: query
>Reporter: Alexander Klimetschek
>
> If a text extraction is broken (weird PDF) or a blob cannot be found in the 
> datastore or any other error upon indexing one item from the repository that 
> is outside the scope of the indexer, it currently halts the complete indexing 
> (lane). Thus one broken item (that maybe isn't important to the users at all) 
> can block the indexing of other, new content (that might be important to 
> users), and it always requires manual intervention to fix (which is also not 
> easy and requires oak experts).
> Instead, the item could be remembered in a known issue list, proper warnings 
> given, and indexing continue. Maintenance operations should be available to 
> come back to reindex these once the issue is fixed, or the indexer could 
> automatically retry after some time.
> I think the line should probably be drawn for binary properties. Not sure if 
> other JCR property types could trigger a similar issue, and if a failure in 
> them might actually warrant a halt, as it could lead to an "incorrect" index, 
> if these properties are important. But maybe the line is simply a try & catch 
> around "full text extraction".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)