[jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing
[ https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247218#comment-16247218 ] Thomas Mueller commented on OAK-5519: - [~jsedding] This only works if text extraction is reading, but in my case it's in an endless loop that doesn't read. Even Thread.interrupt() won't work in that case. However, you can try Thread.stop() if you are adventurous. > Skip problematic binaries instead of blocking indexing > -- > > Key: OAK-5519 > URL: https://issues.apache.org/jira/browse/OAK-5519 > Project: Jackrabbit Oak > Issue Type: New Feature > Components: indexing >Reporter: Alexander Klimetschek >Assignee: Thomas Mueller > Labels: resilience > Fix For: 1.8, 1.7.12 > > > If a text extraction is blocked (weird PDF) or a blob cannot be found in the > datastore or any other error upon indexing one item from the repository that > is outside the scope of the indexer, it currently halts the indexing (lane). > Thus one item (that maybe isn't important to the users at all) can block the > indexing of other, new content (that might be important to users), and it > always requires manual intervention (which is also not easy and requires oak > experts). > Instead, the item could be remembered in a known issue list, proper warnings > given, and indexing continue. Maintenance operations should be available to > come back to reindex these, or the indexer could automatically retry after > some time. This would allow normal user activity to go on without manual > intervention, and solving the problem (if it's isolated to some binaries) can > be deferred. > I think the line should probably be drawn for binary properties. Not sure if > other JCR property types could trigger a similar issue, and if a failure in > them might actually warrant a halt, as it could lead to an "incorrect" index, > if these properties are important. But maybe the line is simply a try & catch > around "full text extraction". -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing
[ https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247189#comment-16247189 ] Julian Sedding commented on OAK-5519: - [~tmueller] could the processing thread be terminated by closing the stream after the timeout? I suppose that should trip up the parser and cause an {{IOException}} on the next read. Granted, I don't understand the full background of this issue. Maybe the endless loop scenarios don't read from the stream any more. > Skip problematic binaries instead of blocking indexing > -- > > Key: OAK-5519 > URL: https://issues.apache.org/jira/browse/OAK-5519 > Project: Jackrabbit Oak > Issue Type: New Feature > Components: indexing >Reporter: Alexander Klimetschek >Assignee: Thomas Mueller > Labels: resilience > Fix For: 1.8, 1.7.12 > > > If a text extraction is blocked (weird PDF) or a blob cannot be found in the > datastore or any other error upon indexing one item from the repository that > is outside the scope of the indexer, it currently halts the indexing (lane). > Thus one item (that maybe isn't important to the users at all) can block the > indexing of other, new content (that might be important to users), and it > always requires manual intervention (which is also not easy and requires oak > experts). > Instead, the item could be remembered in a known issue list, proper warnings > given, and indexing continue. Maintenance operations should be available to > come back to reindex these, or the indexer could automatically retry after > some time. This would allow normal user activity to go on without manual > intervention, and solving the problem (if it's isolated to some binaries) can > be deferred. > I think the line should probably be drawn for binary properties. Not sure if > other JCR property types could trigger a similar issue, and if a failure in > them might actually warrant a halt, as it could lead to an "incorrect" index, > if these properties are important. But maybe the line is simply a try & catch > around "full text extraction". -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing
[ https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246063#comment-16246063 ] Thomas Mueller commented on OAK-5519: - http://svn.apache.org/r1814745 [~chetanm] I have incorporated your requests. Features: * No OSGi / JMX configuration right now, but "emergency" configuration via system properties (for example, ability to disable this feature, set timeout,...) * Timeout is 60 seconds. * Timed out extraction is now stored to a file in the repository / index directory, in a properties file named "textExtractionTimeout.properties". Example content below. This file is read on startup. {noformat} #Text extraction timed out for the following binaries, and will not be retried #Thu Nov 09 12:33:52 CET 2017 405dfb76526462a6268f1aacb09359179216df423c474b3a1f578b9c567faa35\#190148=TextExtractionError d19a28de09b655dbe099ee9e72e5bc782088994cca054062213d80b22f2ac67f\#175=TextExtractionError 251c6082691578dc1aff306a59984e1b80a79befd8465e158335c5cbfe8bb596\#399142=TextExtractionError {noformat} * Failed extraction is cached. * Number of extractions that timed out can be read via JMX (TextExtractionStatsMBean.getTimeoutCount). Each of those threads can consume 100% CPU (unless they stop at some point). * It is using its own executor service with daemon threads. This is shut down when stopping the service, and restarted when needed. Just one thread usually, up to 10 (configurable), so worst case up to 900% CPU usage if 9 extractions time out. * Thread name is "oak binary text extractor" plus the name of the extracted blob (similar to what it was before). * Only binaries larger than 16 KB are extracted in a separate thread. * A warning is logged if extraction times out. * No change for OutOfMemory and so on (Throwable was already caught before this patch). So this patch only affects timeouts. > Skip problematic binaries instead of blocking indexing > -- > > Key: OAK-5519 > URL: https://issues.apache.org/jira/browse/OAK-5519 > Project: Jackrabbit Oak > Issue Type: New Feature > Components: indexing >Reporter: Alexander Klimetschek >Assignee: Thomas Mueller > Labels: resilience > Fix For: 1.8 > > > If a text extraction is blocked (weird PDF) or a blob cannot be found in the > datastore or any other error upon indexing one item from the repository that > is outside the scope of the indexer, it currently halts the indexing (lane). > Thus one item (that maybe isn't important to the users at all) can block the > indexing of other, new content (that might be important to users), and it > always requires manual intervention (which is also not easy and requires oak > experts). > Instead, the item could be remembered in a known issue list, proper warnings > given, and indexing continue. Maintenance operations should be available to > come back to reindex these, or the indexer could automatically retry after > some time. This would allow normal user activity to go on without manual > intervention, and solving the problem (if it's isolated to some binaries) can > be deferred. > I think the line should probably be drawn for binary properties. Not sure if > other JCR property types could trigger a similar issue, and if a failure in > them might actually warrant a halt, as it could lead to an "incorrect" index, > if these properties are important. But maybe the line is simply a try & catch > around "full text extraction". -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing
[ https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244308#comment-16244308 ] Chetan Mehrotra commented on OAK-5519: -- bq. However, after a restart, Oak will not try to extract the same binary again, as indexing continued. Except if you upload the same binary to somewhere else, but I guess that's rare. If that binary is getting indexed due to aggregation then it can happen that same binary is processed again if any other aggregated property gets modified. For e.g. in asset like structure even if the original binary is not touched byt some metadata is updated then that would trigger reindexing of same asset subtree again triggering text extraction bq. Well, as you wrote, using the following query I can get the list of binaries where exaction failed: With one caveat that a Lucene document may contain text extracted from multiple binaries in case of aggregation (not that big a concern in general as others are mostly derived binaries). So this query may flag all binary under a given subtree as blacklisted. But to start with this query is useful for case where text extraction did not ended up in some infinite loop > Skip problematic binaries instead of blocking indexing > -- > > Key: OAK-5519 > URL: https://issues.apache.org/jira/browse/OAK-5519 > Project: Jackrabbit Oak > Issue Type: New Feature > Components: indexing >Reporter: Alexander Klimetschek >Assignee: Thomas Mueller > Labels: resilience > Fix For: 1.8 > > > If a text extraction is blocked (weird PDF) or a blob cannot be found in the > datastore or any other error upon indexing one item from the repository that > is outside the scope of the indexer, it currently halts the indexing (lane). > Thus one item (that maybe isn't important to the users at all) can block the > indexing of other, new content (that might be important to users), and it > always requires manual intervention (which is also not easy and requires oak > experts). > Instead, the item could be remembered in a known issue list, proper warnings > given, and indexing continue. Maintenance operations should be available to > come back to reindex these, or the indexer could automatically retry after > some time. This would allow normal user activity to go on without manual > intervention, and solving the problem (if it's isolated to some binaries) can > be deferred. > I think the line should probably be drawn for binary properties. Not sure if > other JCR property types could trigger a similar issue, and if a failure in > them might actually warrant a halt, as it could lead to an "incorrect" index, > if these properties are important. But maybe the line is simply a try & catch > around "full text extraction". -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing
[ https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244076#comment-16244076 ] Thomas Mueller commented on OAK-5519: - > Going forward we can probably store some hidden property to mark such > binaries to avoid hitting them again (as cache is ephemeral) That's what I thought as well, but actually, I think this is not needed. When adding a bad pdf, text extraction will run, and then timeout, and then the text "TextExtractionError" is stored in the fulltext index. Indexing continues. The thread will continue to consume 100% CPU until the process is killed or the thread is stopped. However, after a restart, Oak will not try to extract the same binary again, as indexing continued. Except if you upload the same binary to somewhere else, but I guess that's rare. > We can possibly store some more data/marker in special field which can then > later be queried to find out all such files which have not been indexed Well, as you wrote, using the following query I can get the list of binaries where exaction failed: {noformat} /jcr:root//*[jcr:contains(., 'textextractionerror')] {noformat} Of course this includes binaries that contain this exact term, but I don't think that's a big problem. > Skip problematic binaries instead of blocking indexing > -- > > Key: OAK-5519 > URL: https://issues.apache.org/jira/browse/OAK-5519 > Project: Jackrabbit Oak > Issue Type: New Feature > Components: indexing >Reporter: Alexander Klimetschek >Assignee: Thomas Mueller > Labels: resilience > Fix For: 1.8 > > > If a text extraction is blocked (weird PDF) or a blob cannot be found in the > datastore or any other error upon indexing one item from the repository that > is outside the scope of the indexer, it currently halts the indexing (lane). > Thus one item (that maybe isn't important to the users at all) can block the > indexing of other, new content (that might be important to users), and it > always requires manual intervention (which is also not easy and requires oak > experts). > Instead, the item could be remembered in a known issue list, proper warnings > given, and indexing continue. Maintenance operations should be available to > come back to reindex these, or the indexer could automatically retry after > some time. This would allow normal user activity to go on without manual > intervention, and solving the problem (if it's isolated to some binaries) can > be deferred. > I think the line should probably be drawn for binary properties. Not sure if > other JCR property types could trigger a similar issue, and if a failure in > them might actually warrant a halt, as it could lead to an "incorrect" index, > if these properties are important. But maybe the line is simply a try & catch > around "full text extraction". -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing
[ https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243754#comment-16243754 ] Chetan Mehrotra commented on OAK-5519: -- bq. the text extraction cache only puts results in the cache if extraction was successful. I wonder why that is, it seems failure should also be cached. +1. Note that currently if a file text extraction fails we store a sentinel value "TextExtractionError" to indicate that there was error processing that. Thinking out loud - Going forward we can probably store some hidden property to mark such binaries to avoid hitting them again (as cache is ephermal). However this would be tricky as IndexEditors currently do not have access to NodeBuilder for that node. May be we can store it in index data in some form (flat file?) > Skip problematic binaries instead of blocking indexing > -- > > Key: OAK-5519 > URL: https://issues.apache.org/jira/browse/OAK-5519 > Project: Jackrabbit Oak > Issue Type: New Feature > Components: indexing >Reporter: Alexander Klimetschek >Assignee: Thomas Mueller > Labels: resilience > Fix For: 1.8 > > > If a text extraction is blocked (weird PDF) or a blob cannot be found in the > datastore or any other error upon indexing one item from the repository that > is outside the scope of the indexer, it currently halts the indexing (lane). > Thus one item (that maybe isn't important to the users at all) can block the > indexing of other, new content (that might be important to users), and it > always requires manual intervention (which is also not easy and requires oak > experts). > Instead, the item could be remembered in a known issue list, proper warnings > given, and indexing continue. Maintenance operations should be available to > come back to reindex these, or the indexer could automatically retry after > some time. This would allow normal user activity to go on without manual > intervention, and solving the problem (if it's isolated to some binaries) can > be deferred. > I think the line should probably be drawn for binary properties. Not sure if > other JCR property types could trigger a similar issue, and if a failure in > them might actually warrant a halt, as it could lead to an "incorrect" index, > if these properties are important. But maybe the line is simply a try & catch > around "full text extraction". -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing
[ https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243702#comment-16243702 ] Thomas Mueller commented on OAK-5519: - I found out why there are two threads consuming 100% each, and not just one: the text extraction cache only puts results in the cache if extraction was successful. I wonder why that is, it seems failure should also be cached. What do you think, [~chetanm], [~catholicon]? {noformat} public void put(@Nonnull Blob blob, @Nonnull ExtractedText extractedText) { String id = blob.getContentIdentity(); if (extractedText.getExtractionResult() == ExtractedText.ExtractionResult.SUCCESS && ...) { cache.put(id, extractedText.getExtractedText().toString()); } } {noformat} > Skip problematic binaries instead of blocking indexing > -- > > Key: OAK-5519 > URL: https://issues.apache.org/jira/browse/OAK-5519 > Project: Jackrabbit Oak > Issue Type: New Feature > Components: indexing >Reporter: Alexander Klimetschek >Assignee: Thomas Mueller > Labels: resilience > Fix For: 1.8 > > > If a text extraction is blocked (weird PDF) or a blob cannot be found in the > datastore or any other error upon indexing one item from the repository that > is outside the scope of the indexer, it currently halts the indexing (lane). > Thus one item (that maybe isn't important to the users at all) can block the > indexing of other, new content (that might be important to users), and it > always requires manual intervention (which is also not easy and requires oak > experts). > Instead, the item could be remembered in a known issue list, proper warnings > given, and indexing continue. Maintenance operations should be available to > come back to reindex these, or the indexer could automatically retry after > some time. This would allow normal user activity to go on without manual > intervention, and solving the problem (if it's isolated to some binaries) can > be deferred. > I think the line should probably be drawn for binary properties. Not sure if > other JCR property types could trigger a similar issue, and if a failure in > them might actually warrant a halt, as it could lead to an "incorrect" index, > if these properties are important. But maybe the line is simply a try & catch > around "full text extraction". -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing
[ https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243573#comment-16243573 ] Thomas Mueller commented on OAK-5519: - My current approach is: extract larger binaries using a separate thread (ExecutorService, in ExtractedTextCache). Small binaries (less than 16 KB) are still extracted in the regular thread. If extraction takes longer than the timeout (1 minute right now), then ignore this binary and continue. Current behavior: * When trying to extract a binary that takes very long (or extraction has an endless loop), then the thread continues running, but extraction isn't blocked. * The extraction thread has a "nice" thread name (includes the path of the node, binary,...). * The process can be stopped normally as the extraction thread is a daemon thread. * When restarting the process, extraction of that binary is _not_ retried. Open points: * Should log a warning / error that text extraction failed for this binary. * Add JMX support to detect runaway text extraction threads (in order to restart the process or manually stop those threads). * Default values should be configurable. * On endless loop in extraction, currently there are two threads consuming 100% each. Instead of just one. Need to investigate (looks like there are two caches, which sounds wrong). > Skip problematic binaries instead of blocking indexing > -- > > Key: OAK-5519 > URL: https://issues.apache.org/jira/browse/OAK-5519 > Project: Jackrabbit Oak > Issue Type: New Feature > Components: indexing >Reporter: Alexander Klimetschek >Assignee: Thomas Mueller > Labels: resilience > Fix For: 1.8 > > > If a text extraction is blocked (weird PDF) or a blob cannot be found in the > datastore or any other error upon indexing one item from the repository that > is outside the scope of the indexer, it currently halts the indexing (lane). > Thus one item (that maybe isn't important to the users at all) can block the > indexing of other, new content (that might be important to users), and it > always requires manual intervention (which is also not easy and requires oak > experts). > Instead, the item could be remembered in a known issue list, proper warnings > given, and indexing continue. Maintenance operations should be available to > come back to reindex these, or the indexer could automatically retry after > some time. This would allow normal user activity to go on without manual > intervention, and solving the problem (if it's isolated to some binaries) can > be deferred. > I think the line should probably be drawn for binary properties. Not sure if > other JCR property types could trigger a similar issue, and if a failure in > them might actually warrant a halt, as it could lead to an "incorrect" index, > if these properties are important. But maybe the line is simply a try & catch > around "full text extraction". -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing
[ https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16104727#comment-16104727 ] Chetan Mehrotra commented on OAK-5519: -- bq. it does nothing except throw an exception / error / out of memory error every time Currently the [BinaryTextExtractor|https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/binary/BinaryTextExtractor.java#L161] is capturing throwable while calling Tika (thinking about it we should handle it better and distinguish between serious error). The case which is not handled is where parser enters into some infinite loop while processing and for which the process needs to be killed. May be we use an executor to handle text extraction and on indexer thread submit a job to it with some timeout. In case of any such error the executor pool thread would get blocked but indexer thread can continue and system admin can look into restarting the process. This would allow us to detect such problamatic binaries more reliably and we only need to remember them in case of timeout or some exception in processing. Thoughts? > Skip problematic binaries instead of blocking indexing > -- > > Key: OAK-5519 > URL: https://issues.apache.org/jira/browse/OAK-5519 > Project: Jackrabbit Oak > Issue Type: New Feature > Components: indexing >Reporter: Alexander Klimetschek > Labels: resilience > Fix For: 1.8 > > > If a text extraction is blocked (weird PDF) or a blob cannot be found in the > datastore or any other error upon indexing one item from the repository that > is outside the scope of the indexer, it currently halts the indexing (lane). > Thus one item (that maybe isn't important to the users at all) can block the > indexing of other, new content (that might be important to users), and it > always requires manual intervention (which is also not easy and requires oak > experts). > Instead, the item could be remembered in a known issue list, proper warnings > given, and indexing continue. Maintenance operations should be available to > come back to reindex these, or the indexer could automatically retry after > some time. This would allow normal user activity to go on without manual > intervention, and solving the problem (if it's isolated to some binaries) can > be deferred. > I think the line should probably be drawn for binary properties. Not sure if > other JCR property types could trigger a similar issue, and if a failure in > them might actually warrant a halt, as it could lead to an "incorrect" index, > if these properties are important. But maybe the line is simply a try & catch > around "full text extraction". -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing
[ https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16101666#comment-16101666 ] Thomas Mueller commented on OAK-5519: - [~catholicon] and [~chetanm] I think we should try the "Memory of bad file" solution, if that's simple. I assume we could write a test case first, that uses a "custom" Tika config as documented in http://jackrabbit.apache.org/oak/docs/query/lucene.html#Tika_Config, custom in that it does nothing except throw an exception / error / out of memory error every time. Then try if this runs into an endless loop. Then remember the file if it fails *three times* in a row. I think it would be better to wait three times, because the first time might be due to a non-repeatable problems (out of memory caused by another thread). > Skip problematic binaries instead of blocking indexing > -- > > Key: OAK-5519 > URL: https://issues.apache.org/jira/browse/OAK-5519 > Project: Jackrabbit Oak > Issue Type: New Feature > Components: indexing >Reporter: Alexander Klimetschek > Labels: resilience > Fix For: 1.8 > > > If a text extraction is blocked (weird PDF) or a blob cannot be found in the > datastore or any other error upon indexing one item from the repository that > is outside the scope of the indexer, it currently halts the indexing (lane). > Thus one item (that maybe isn't important to the users at all) can block the > indexing of other, new content (that might be important to users), and it > always requires manual intervention (which is also not easy and requires oak > experts). > Instead, the item could be remembered in a known issue list, proper warnings > given, and indexing continue. Maintenance operations should be available to > come back to reindex these, or the indexer could automatically retry after > some time. This would allow normal user activity to go on without manual > intervention, and solving the problem (if it's isolated to some binaries) can > be deferred. > I think the line should probably be drawn for binary properties. Not sure if > other JCR property types could trigger a similar issue, and if a failure in > them might actually warrant a halt, as it could lead to an "incorrect" index, > if these properties are important. But maybe the line is simply a try & catch > around "full text extraction". -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing
[ https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15996500#comment-15996500 ] Thomas Mueller commented on OAK-5519: - Do we have a test case (for example a PDF file that runs out of memory no matter how much heap is available)? > Skip problematic binaries instead of blocking indexing > -- > > Key: OAK-5519 > URL: https://issues.apache.org/jira/browse/OAK-5519 > Project: Jackrabbit Oak > Issue Type: New Feature > Components: indexing >Reporter: Alexander Klimetschek > Labels: resilience > Fix For: 1.8 > > > If a text extraction is blocked (weird PDF) or a blob cannot be found in the > datastore or any other error upon indexing one item from the repository that > is outside the scope of the indexer, it currently halts the indexing (lane). > Thus one item (that maybe isn't important to the users at all) can block the > indexing of other, new content (that might be important to users), and it > always requires manual intervention (which is also not easy and requires oak > experts). > Instead, the item could be remembered in a known issue list, proper warnings > given, and indexing continue. Maintenance operations should be available to > come back to reindex these, or the indexer could automatically retry after > some time. This would allow normal user activity to go on without manual > intervention, and solving the problem (if it's isolated to some binaries) can > be deferred. > I think the line should probably be drawn for binary properties. Not sure if > other JCR property types could trigger a similar issue, and if a failure in > them might actually warrant a halt, as it could lead to an "incorrect" index, > if these properties are important. But maybe the line is simply a try & catch > around "full text extraction". -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing
[ https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15976491#comment-15976491 ] Chetan Mehrotra commented on OAK-5519: -- *Problematic Binary Handling* h3. A - Out of process Best solution for this case is making use of out of process text extraction i.e. TIKA-416 which was used in JR2 JCR-2864. My hunch is that this might not work in OSGi deployment as this implementation relies on [classloader which copies the class content to child process classloader|https://jukkaz.wordpress.com/2010/05/27/forking-a-jvm/]. Also be aware of TIKA-591 here So something to try! h3. B - Memory of bad file Untill #A can be implement we can implement some support where we "memorize" the last file being processed. Main problem handling such files is that due to bug in parser we may end in infinite loop or out of memory. In both cases current context is lost and if index starts again it would again hit the same file as it would not remember that it was a bad file. One way would be to record the file for which text is to be extracted in a file on filesystem and in case of unclean end use this file to find out the last file which was being processed and then exclude that > Skip problematic binaries instead of blocking indexing > -- > > Key: OAK-5519 > URL: https://issues.apache.org/jira/browse/OAK-5519 > Project: Jackrabbit Oak > Issue Type: New Feature > Components: indexing >Reporter: Alexander Klimetschek > Labels: resilience > > If a text extraction is blocked (weird PDF) or a blob cannot be found in the > datastore or any other error upon indexing one item from the repository that > is outside the scope of the indexer, it currently halts the indexing (lane). > Thus one item (that maybe isn't important to the users at all) can block the > indexing of other, new content (that might be important to users), and it > always requires manual intervention (which is also not easy and requires oak > experts). > Instead, the item could be remembered in a known issue list, proper warnings > given, and indexing continue. Maintenance operations should be available to > come back to reindex these, or the indexer could automatically retry after > some time. This would allow normal user activity to go on without manual > intervention, and solving the problem (if it's isolated to some binaries) can > be deferred. > I think the line should probably be drawn for binary properties. Not sure if > other JCR property types could trigger a similar issue, and if a failure in > them might actually warrant a halt, as it could lead to an "incorrect" index, > if these properties are important. But maybe the line is simply a try & catch > around "full text extraction". -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing
[ https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15976472#comment-15976472 ] Chetan Mehrotra commented on OAK-5519: -- bq. It probably makes sense to deal with OOME as well (at least catch it and log the stack trace). Would it be ok to log and still continue with indexing? > Skip problematic binaries instead of blocking indexing > -- > > Key: OAK-5519 > URL: https://issues.apache.org/jira/browse/OAK-5519 > Project: Jackrabbit Oak > Issue Type: New Feature > Components: indexing >Reporter: Alexander Klimetschek > Labels: resilience > > If a text extraction is blocked (weird PDF) or a blob cannot be found in the > datastore or any other error upon indexing one item from the repository that > is outside the scope of the indexer, it currently halts the indexing (lane). > Thus one item (that maybe isn't important to the users at all) can block the > indexing of other, new content (that might be important to users), and it > always requires manual intervention (which is also not easy and requires oak > experts). > Instead, the item could be remembered in a known issue list, proper warnings > given, and indexing continue. Maintenance operations should be available to > come back to reindex these, or the indexer could automatically retry after > some time. This would allow normal user activity to go on without manual > intervention, and solving the problem (if it's isolated to some binaries) can > be deferred. > I think the line should probably be drawn for binary properties. Not sure if > other JCR property types could trigger a similar issue, and if a failure in > them might actually warrant a halt, as it could lead to an "incorrect" index, > if these properties are important. But maybe the line is simply a try & catch > around "full text extraction". -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing
[ https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15976382#comment-15976382 ] Thomas Mueller commented on OAK-5519: - I recently saw OutOfMemory error during the index update; I'm not sure if that's caused by a problematic binary, a bug in the PDF text extraction tool, or something else. It probably makes sense to deal with OOME as well (at least catch it and log the stack trace). > Skip problematic binaries instead of blocking indexing > -- > > Key: OAK-5519 > URL: https://issues.apache.org/jira/browse/OAK-5519 > Project: Jackrabbit Oak > Issue Type: New Feature > Components: indexing >Reporter: Alexander Klimetschek > Labels: resilience > > If a text extraction is blocked (weird PDF) or a blob cannot be found in the > datastore or any other error upon indexing one item from the repository that > is outside the scope of the indexer, it currently halts the indexing (lane). > Thus one item (that maybe isn't important to the users at all) can block the > indexing of other, new content (that might be important to users), and it > always requires manual intervention (which is also not easy and requires oak > experts). > Instead, the item could be remembered in a known issue list, proper warnings > given, and indexing continue. Maintenance operations should be available to > come back to reindex these, or the indexer could automatically retry after > some time. This would allow normal user activity to go on without manual > intervention, and solving the problem (if it's isolated to some binaries) can > be deferred. > I think the line should probably be drawn for binary properties. Not sure if > other JCR property types could trigger a similar issue, and if a failure in > them might actually warrant a halt, as it could lead to an "incorrect" index, > if these properties are important. But maybe the line is simply a try & catch > around "full text extraction". -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing
[ https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15840316#comment-15840316 ] Alexander Klimetschek commented on OAK-5519: Related issues: * OAK-4939 addresses this in 1.5 and 1.6, but considers the entire index "corrupted" and isolates it; if this is an important full text index, then it would still impact users as they won't find the other content (that is fine) * OAK-3813 that I reported earlier which is about datastore failing to resolve blobs (in this case S3 where you might have more failure scenarios) > Skip problematic binaries instead of blocking indexing > -- > > Key: OAK-5519 > URL: https://issues.apache.org/jira/browse/OAK-5519 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: query >Reporter: Alexander Klimetschek > > If a text extraction is broken (weird PDF) or a blob cannot be found in the > datastore or any other error upon indexing one item from the repository that > is outside the scope of the indexer, it currently halts the complete indexing > (lane). Thus one broken item (that maybe isn't important to the users at all) > can block the indexing of other, new content (that might be important to > users), and it always requires manual intervention to fix (which is also not > easy and requires oak experts). > Instead, the item could be remembered in a known issue list, proper warnings > given, and indexing continue. Maintenance operations should be available to > come back to reindex these once the issue is fixed, or the indexer could > automatically retry after some time. > I think the line should probably be drawn for binary properties. Not sure if > other JCR property types could trigger a similar issue, and if a failure in > them might actually warrant a halt, as it could lead to an "incorrect" index, > if these properties are important. But maybe the line is simply a try & catch > around "full text extraction". -- This message was sent by Atlassian JIRA (v6.3.4#6332)