[jira] [Commented] (TIKA-1330) Add robust tika-batch code
[ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481555#comment-14481555 ] Konstantin Gribov commented on TIKA-1330: - [~talli...@mitre.org], you have mixed CRLF/LF in tika-batch, in pom.xml at least. Is it OK if I fix it or you're working on this module now? Add robust tika-batch code -- Key: TIKA-1330 URL: https://issues.apache.org/jira/browse/TIKA-1330 Project: Tika Issue Type: Sub-task Components: cli, general, server Reporter: Tim Allison Assignee: Tim Allison Attachments: TIKA-1330v1-patch.zip In my current design plan, I see creating a separate component tika-batch that includes a small bit of configurable code to run Tika against a large batch of documents. This code should be robust against OOM and hangs, and it should have fairly robust logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1330) Add robust tika-batch code
[ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481614#comment-14481614 ] Hudson commented on TIKA-1330: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #607 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/607/]) tika-batch cosmetics TIKA-1330 (grossws: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1671627) * /tika/trunk/tika-batch/pom.xml Add robust tika-batch code -- Key: TIKA-1330 URL: https://issues.apache.org/jira/browse/TIKA-1330 Project: Tika Issue Type: Sub-task Components: cli, general, server Reporter: Tim Allison Assignee: Tim Allison Attachments: TIKA-1330v1-patch.zip In my current design plan, I see creating a separate component tika-batch that includes a small bit of configurable code to run Tika against a large batch of documents. This code should be robust against OOM and hangs, and it should have fairly robust logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1330) Add robust tika-batch code
[ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481902#comment-14481902 ] Tim Allison commented on TIKA-1330: --- Thank you, [~grossws]! Add robust tika-batch code -- Key: TIKA-1330 URL: https://issues.apache.org/jira/browse/TIKA-1330 Project: Tika Issue Type: Sub-task Components: cli, general, server Reporter: Tim Allison Assignee: Tim Allison Attachments: TIKA-1330v1-patch.zip In my current design plan, I see creating a separate component tika-batch that includes a small bit of configurable code to run Tika against a large batch of documents. This code should be robust against OOM and hangs, and it should have fairly robust logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1330) Add robust tika-batch code
[ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481917#comment-14481917 ] Konstantin Gribov commented on TIKA-1330: - That was just a test to check that INFRA-9355 is fixed. If you aren't working on tika-batch now, I can do simple stylecheck (for indent, CRLF/LF consistency). But if you are working on it now, I'll just put it into todo list to avoid heavy merges in work-in-progress code. Add robust tika-batch code -- Key: TIKA-1330 URL: https://issues.apache.org/jira/browse/TIKA-1330 Project: Tika Issue Type: Sub-task Components: cli, general, server Reporter: Tim Allison Assignee: Tim Allison Attachments: TIKA-1330v1-patch.zip In my current design plan, I see creating a separate component tika-batch that includes a small bit of configurable code to run Tika against a large batch of documents. This code should be robust against OOM and hangs, and it should have fairly robust logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1330) Add robust tika-batch code
[ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482242#comment-14482242 ] Tim Allison commented on TIKA-1330: --- Thank you for the offer! I'm going to turn to some other tasks for now so please fix whatever you'd like. I'm happy to make the fixes, too. I'm sorry that CRs snuck into the code. I need to check the settings in Intellij, again. Add robust tika-batch code -- Key: TIKA-1330 URL: https://issues.apache.org/jira/browse/TIKA-1330 Project: Tika Issue Type: Sub-task Components: cli, general, server Reporter: Tim Allison Assignee: Tim Allison Attachments: TIKA-1330v1-patch.zip In my current design plan, I see creating a separate component tika-batch that includes a small bit of configurable code to run Tika against a large batch of documents. This code should be robust against OOM and hangs, and it should have fairly robust logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1330) Add robust tika-batch code
[ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482241#comment-14482241 ] Tim Allison commented on TIKA-1330: --- Thank you for the offer! I'm going to turn to some other tasks for now so please fix whatever you'd like. I'm happy to make the fixes, too. I'm sorry that CRs snuck into the code. I need to check the settings in Intellij, again. Add robust tika-batch code -- Key: TIKA-1330 URL: https://issues.apache.org/jira/browse/TIKA-1330 Project: Tika Issue Type: Sub-task Components: cli, general, server Reporter: Tim Allison Assignee: Tim Allison Attachments: TIKA-1330v1-patch.zip In my current design plan, I see creating a separate component tika-batch that includes a small bit of configurable code to run Tika against a large batch of documents. This code should be robust against OOM and hangs, and it should have fairly robust logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [jira] [Commented] (TIKA-1330) Add robust tika-batch code
All tests are passing. Only issue I see is excessive logging. The Hudson failure does just look like a hiccup. Tyler On Wed, Apr 1, 2015 at 2:55 PM, Allison, Timothy B. talli...@mitre.org wrote: This looks like a Hudson hiccup. Tyler is seeing excessive logging: Running org.apache.tika.cli.TikaCLIBatchIntegrationTest INFO - about to start driver INFO - about to start driver Anyone else having problems building from a fresh trunk? -Original Message- From: Hudson (JIRA) [mailto:j...@apache.org] Sent: Wednesday, April 01, 2015 5:36 PM To: dev@tika.apache.org Subject: [jira] [Commented] (TIKA-1330) Add robust tika-batch code [ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391512#comment-14391512 ] Hudson commented on TIKA-1330: -- ABORTED: Integrated in tika-trunk-jdk1.7 #596 (See [ https://builds.apache.org/job/tika-trunk-jdk1.7/596/]) TIKA-1330 flush stacktrace writers (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670756) * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/util/TikaExceptionFilter.java TIKA-1330 clean up logging in tika-batch ant tika-app integration of tika-batch, take 2 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670751) * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java Add robust tika-batch code -- Key: TIKA-1330 URL: https://issues.apache.org/jira/browse/TIKA-1330 Project: Tika Issue Type: Sub-task Components: cli, general, server Reporter: Tim Allison Assignee: Tim Allison Attachments: TIKA-1330v1-patch.zip In my current design plan, I see creating a separate component tika-batch that includes a small bit of configurable code to run Tika against a large batch of documents. This code should be robust against OOM and hangs, and it should have fairly robust logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1330) Add robust tika-batch code
[ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392027#comment-14392027 ] Hudson commented on TIKA-1330: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #599 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/599/]) TIKA-1330 fix logging in TikaCLI to avoid adding multiple appenders (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670804) * /tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java * /tika/trunk/tika-app/src/main/resources/log4j.properties Add robust tika-batch code -- Key: TIKA-1330 URL: https://issues.apache.org/jira/browse/TIKA-1330 Project: Tika Issue Type: Sub-task Components: cli, general, server Reporter: Tim Allison Assignee: Tim Allison Attachments: TIKA-1330v1-patch.zip In my current design plan, I see creating a separate component tika-batch that includes a small bit of configurable code to run Tika against a large batch of documents. This code should be robust against OOM and hangs, and it should have fairly robust logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
RE: [jira] [Commented] (TIKA-1330) Add robust tika-batch code
On the duplication/triplication of INFO - about to start driver INFO - about to start driver was because main() adds a new appender with .configure()...so subsequent calls to main() in the tests were adding more appenders. I just fixed that in r1670804. What I can't figure out is why you're seeing anything. I've redirected both stdout and stderr in setup() (annotated @Before) to ByteArrayOutputStreams. If setup() weren't being called, you'd get NPEs for each of the four tests, so setup() must be getting calledh -Original Message- From: Tyler Palsulich [mailto:tpalsul...@gmail.com] Sent: Wednesday, April 01, 2015 7:39 PM To: dev@tika.apache.org Subject: Re: [jira] [Commented] (TIKA-1330) Add robust tika-batch code All tests are passing. Only issue I see is excessive logging. The Hudson failure does just look like a hiccup. Tyler On Wed, Apr 1, 2015 at 2:55 PM, Allison, Timothy B. talli...@mitre.org wrote: This looks like a Hudson hiccup. Tyler is seeing excessive logging: Running org.apache.tika.cli.TikaCLIBatchIntegrationTest INFO - about to start driver INFO - about to start driver Anyone else having problems building from a fresh trunk? -Original Message- From: Hudson (JIRA) [mailto:j...@apache.org] Sent: Wednesday, April 01, 2015 5:36 PM To: dev@tika.apache.org Subject: [jira] [Commented] (TIKA-1330) Add robust tika-batch code [ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391512#comment-14391512 ] Hudson commented on TIKA-1330: -- ABORTED: Integrated in tika-trunk-jdk1.7 #596 (See [ https://builds.apache.org/job/tika-trunk-jdk1.7/596/]) TIKA-1330 flush stacktrace writers (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670756) * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/util/TikaExceptionFilter.java TIKA-1330 clean up logging in tika-batch ant tika-app integration of tika-batch, take 2 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670751) * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java Add robust tika-batch code -- Key: TIKA-1330 URL: https://issues.apache.org/jira/browse/TIKA-1330 Project: Tika Issue Type: Sub-task Components: cli, general, server Reporter: Tim Allison Assignee: Tim Allison Attachments: TIKA-1330v1-patch.zip In my current design plan, I see creating a separate component tika-batch that includes a small bit of configurable code to run Tika against a large batch of documents. This code should be robust against OOM and hangs, and it should have fairly robust logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1330) Add robust tika-batch code
[ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391255#comment-14391255 ] Hudson commented on TIKA-1330: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #595 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/595/]) TIKA-1330 clean up logging in tika-batch ant tika-app integration of tika-batch (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670749) * /tika/trunk/tika-app/src/main/java/org/apache/tika/cli/BatchCommandLineBuilder.java * /tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java * /tika/trunk/tika-app/src/main/resources/log4j_batch_process.properties * /tika/trunk/tika-app/src/test/java/org/apache/tika/cli/TikaCLIBatchIntegrationTest.java * /tika/trunk/tika-app/src/test/java/org/apache/tika/cli/TikaCLITest.java * /tika/trunk/tika-app/src/test/resources/log4j_batch_process_test.properties * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/BatchProcess.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/BatchProcessDriverCLI.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/ParallelFileProcessingResult.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/AbstractFSConsumer.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/BasicTikaFSConsumer.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/FSBatchProcessCLI.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/RecursiveParserWrapperFSConsumer.java * /tika/trunk/tika-batch/src/test/resources/log4j.properties * /tika/trunk/tika-batch/src/test/resources/log4j_process.properties Add robust tika-batch code -- Key: TIKA-1330 URL: https://issues.apache.org/jira/browse/TIKA-1330 Project: Tika Issue Type: Sub-task Components: cli, general, server Reporter: Tim Allison Assignee: Tim Allison Attachments: TIKA-1330v1-patch.zip In my current design plan, I see creating a separate component tika-batch that includes a small bit of configurable code to run Tika against a large batch of documents. This code should be robust against OOM and hangs, and it should have fairly robust logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
RE: [jira] [Commented] (TIKA-1330) Add robust tika-batch code
This looks like a Hudson hiccup. Tyler is seeing excessive logging: Running org.apache.tika.cli.TikaCLIBatchIntegrationTest INFO - about to start driver INFO - about to start driver Anyone else having problems building from a fresh trunk? -Original Message- From: Hudson (JIRA) [mailto:j...@apache.org] Sent: Wednesday, April 01, 2015 5:36 PM To: dev@tika.apache.org Subject: [jira] [Commented] (TIKA-1330) Add robust tika-batch code [ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391512#comment-14391512 ] Hudson commented on TIKA-1330: -- ABORTED: Integrated in tika-trunk-jdk1.7 #596 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/596/]) TIKA-1330 flush stacktrace writers (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670756) * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/util/TikaExceptionFilter.java TIKA-1330 clean up logging in tika-batch ant tika-app integration of tika-batch, take 2 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670751) * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java Add robust tika-batch code -- Key: TIKA-1330 URL: https://issues.apache.org/jira/browse/TIKA-1330 Project: Tika Issue Type: Sub-task Components: cli, general, server Reporter: Tim Allison Assignee: Tim Allison Attachments: TIKA-1330v1-patch.zip In my current design plan, I see creating a separate component tika-batch that includes a small bit of configurable code to run Tika against a large batch of documents. This code should be robust against OOM and hangs, and it should have fairly robust logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1330) Add robust tika-batch code
[ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391512#comment-14391512 ] Hudson commented on TIKA-1330: -- ABORTED: Integrated in tika-trunk-jdk1.7 #596 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/596/]) TIKA-1330 flush stacktrace writers (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670756) * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/util/TikaExceptionFilter.java TIKA-1330 clean up logging in tika-batch ant tika-app integration of tika-batch, take 2 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670751) * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java Add robust tika-batch code -- Key: TIKA-1330 URL: https://issues.apache.org/jira/browse/TIKA-1330 Project: Tika Issue Type: Sub-task Components: cli, general, server Reporter: Tim Allison Assignee: Tim Allison Attachments: TIKA-1330v1-patch.zip In my current design plan, I see creating a separate component tika-batch that includes a small bit of configurable code to run Tika against a large batch of documents. This code should be robust against OOM and hangs, and it should have fairly robust logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1330) Add robust tika-batch code
[ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387394#comment-14387394 ] Hudson commented on TIKA-1330: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #588 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/588/]) TIKA-1330, trivial fixes to avoid NPE with consumersManagerMaxMillis parameter (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670185) * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/builders/BasicTikaFSConsumersBuilder.java * /tika/trunk/tika-batch/src/main/resources/org/apache/tika/batch/fs/default-tika-batch-config.xml Add robust tika-batch code -- Key: TIKA-1330 URL: https://issues.apache.org/jira/browse/TIKA-1330 Project: Tika Issue Type: Sub-task Components: cli, general, server Reporter: Tim Allison Assignee: Tim Allison Attachments: TIKA-1330v1-patch.zip In my current design plan, I see creating a separate component tika-batch that includes a small bit of configurable code to run Tika against a large batch of documents. This code should be robust against OOM and hangs, and it should have fairly robust logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1330) Add robust tika-batch code
[ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387864#comment-14387864 ] Hudson commented on TIKA-1330: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #589 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/589/]) TIKA-1330: add integration tests to TikaCLITest (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670237) * /tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java * /tika/trunk/tika-app/src/test/java/org/apache/tika/cli/TikaCLITest.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/SimpleLogReporterBuilder.java * /tika/trunk/tika-batch/src/main/resources/org/apache/tika/batch/fs/default-tika-batch-config.xml * /tika/trunk/tika-batch/src/test/resources/tika-batch-config-MockConsumersBuilder.xml * /tika/trunk/tika-batch/src/test/resources/tika-batch-config-broken.xml * /tika/trunk/tika-batch/src/test/resources/tika-batch-config-test.xml Add robust tika-batch code -- Key: TIKA-1330 URL: https://issues.apache.org/jira/browse/TIKA-1330 Project: Tika Issue Type: Sub-task Components: cli, general, server Reporter: Tim Allison Assignee: Tim Allison Attachments: TIKA-1330v1-patch.zip In my current design plan, I see creating a separate component tika-batch that includes a small bit of configurable code to run Tika against a large batch of documents. This code should be robust against OOM and hangs, and it should have fairly robust logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1330) Add robust tika-batch code
[ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14380479#comment-14380479 ] Hudson commented on TIKA-1330: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #571 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/571/]) TIKA-1330 clean up logging and some dependencies. Still some log4j dependencies for now (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1669187) * /tika/trunk/CHANGES.txt * /tika/trunk/tika-batch/pom.xml * /tika/trunk/tika-batch/src/main/examples * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/BatchProcess.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/BatchProcessDriverCLI.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceCrawler.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/Interrupter.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/StatusReporter.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/AbstractFSConsumer.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/FSBatchProcessCLI.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/RecursiveParserWrapperFSConsumer.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/strawman/StrawManTikaAppDriver.java Add robust tika-batch code -- Key: TIKA-1330 URL: https://issues.apache.org/jira/browse/TIKA-1330 Project: Tika Issue Type: Sub-task Components: cli, general, server Reporter: Tim Allison Assignee: Tim Allison Attachments: TIKA-1330v1-patch.zip In my current design plan, I see creating a separate component tika-batch that includes a small bit of configurable code to run Tika against a large batch of documents. This code should be robust against OOM and hangs, and it should have fairly robust logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1330) Add robust tika-batch code
[ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376175#comment-14376175 ] Hudson commented on TIKA-1330: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #566 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/566/]) initial commit of TIKA-1330 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1668673) * /tika/trunk/CHANGES.txt * /tika/trunk/pom.xml * /tika/trunk/tika-app/pom.xml * /tika/trunk/tika-app/src/main/java/org/apache/tika/cli/BatchCommandLineBuilder.java * /tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java * /tika/trunk/tika-app/src/main/resources/log4j.properties * /tika/trunk/tika-app/src/test/java/org/apache/tika/cli/TikaCLIBatchCommandLineTest.java * /tika/trunk/tika-batch * /tika/trunk/tika-batch/pom.xml * /tika/trunk/tika-batch/src * /tika/trunk/tika-batch/src/main * /tika/trunk/tika-batch/src/main/examples * /tika/trunk/tika-batch/src/main/examples/batchExecutor.sh * /tika/trunk/tika-batch/src/main/examples/log4j.xml * /tika/trunk/tika-batch/src/main/examples/log4j_driver.xml * /tika/trunk/tika-batch/src/main/java * /tika/trunk/tika-batch/src/main/java/org * /tika/trunk/tika-batch/src/main/java/org/apache * /tika/trunk/tika-batch/src/main/java/org/apache/tika * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/AutoDetectParserFactory.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/BatchNoRestartError.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/BatchProcess.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/BatchProcessDriverCLI.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/ConsumersManager.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileConsumerFutureResult.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResource.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceCrawler.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceCrawlerFutureResult.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileStarted.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/IFileProcessorFutureResult.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/Interrupter.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/InterrupterFutureResult.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/OutputStreamFactory.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/ParallelFileProcessingResult.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/ParserFactory.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/PoisonFileResource.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/StatusReporter.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/StatusReporterFutureResult.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/AbstractConsumersBuilder.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/BatchProcessBuilder.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/CommandLineParserBuilder.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/DefaultContentHandlerFactoryBuilder.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/IContentHandlerFactoryBuilder.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/ICrawlerBuilder.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/InterrupterBuilder.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/ObjectFromDOMAndQueueBuilder.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/ObjectFromDOMBuilder.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/ReporterBuilder.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/SimpleLogReporterBuilder.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/StatusReporterBuilder.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/AbstractFSConsumer.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/BasicTikaFSConsumer.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/FSBatchProcessCLI.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/FSConsumersManager.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/FSDirectoryCrawler.java *
[jira] [Commented] (TIKA-1330) Add robust tika-batch code
[ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348047#comment-14348047 ] Tim Allison commented on TIKA-1330: --- Posted patch to review board [31758|https://reviews.apache.org/r/31758/] Some more work is needed, but this is ready for thumbs up/thumbs down and any and all review. Depending on feedback, I'd like to merge this into trunk over the next week or two. Simplest way to run it is from tika-app: java -jar tika-app---.jar input-dir output-dir Add robust tika-batch code -- Key: TIKA-1330 URL: https://issues.apache.org/jira/browse/TIKA-1330 Project: Tika Issue Type: Sub-task Components: cli, general, server Reporter: Tim Allison Assignee: Tim Allison Attachments: TIKA-1330v1-patch.zip In my current design plan, I see creating a separate component tika-batch that includes a small bit of configurable code to run Tika against a large batch of documents. This code should be robust against OOM and hangs, and it should have fairly robust logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1330) Add robust tika-batch code
[ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205290#comment-14205290 ] Tim Allison commented on TIKA-1330: --- Added preliminary integration into tika-app on github [fork|https://github.com/tballison/tika/tree/TIKA-1302] (branch TIKA-1302). minimal call: java -jar tika-app.jar input-directory Add robust tika-batch code -- Key: TIKA-1330 URL: https://issues.apache.org/jira/browse/TIKA-1330 Project: Tika Issue Type: Sub-task Components: cli, general, server Reporter: Tim Allison Assignee: Tim Allison Attachments: TIKA-1330v1-patch.zip In my current design plan, I see creating a separate component tika-batch that includes a small bit of configurable code to run Tika against a large batch of documents. This code should be robust against OOM and hangs, and it should have fairly robust logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1330) Add robust tika-batch code
[ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147922#comment-14147922 ] Tim Allison commented on TIKA-1330: --- [~tilman], I leave it as an exercise to implement a FileResourceConsumer that uses pure PDFBox. ;) Seriously, though, I plan to add something like that in the tika examples module (at some point down the road), and all feedback is welcome. Add robust tika-batch code -- Key: TIKA-1330 URL: https://issues.apache.org/jira/browse/TIKA-1330 Project: Tika Issue Type: Sub-task Components: cli, general, server Reporter: Tim Allison Assignee: Tim Allison Attachments: TIKA-1330v1-patch.zip In my current design plan, I see creating a separate component tika-batch that includes a small bit of configurable code to run Tika against a large batch of documents. This code should be robust against OOM and hangs, and it should have fairly robust logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1330) Add robust tika-batch code
[ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121454#comment-14121454 ] Tim Allison commented on TIKA-1330: --- Started documentation on the [wiki|https://wiki.apache.org/tika/TikaBatch]. Any and all feedback is welcomed. Will post patch to rb (if possible) or to this issue some time next week. Add robust tika-batch code -- Key: TIKA-1330 URL: https://issues.apache.org/jira/browse/TIKA-1330 Project: Tika Issue Type: Sub-task Components: cli, general, server Reporter: Tim Allison Assignee: Tim Allison In my current design plan, I see creating a separate component tika-batch that includes a small bit of configurable code to run Tika against a large batch of documents. This code should be robust against OOM and hangs, and it should have fairly robust logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1330) Add robust tika-batch code
[ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119745#comment-14119745 ] Tim Allison commented on TIKA-1330: --- Looks like ballpark estimate on time for processing on TIKA-1302 was about right. I just finished a complete run of govdocs1 (~1 million files) on an 8 cpu vm with 8 gb available, -Xmx4g. The run used 15 consumers and completed in about 4 hours. The driver restarted the process thirteen times (6 permanent hangs and 7 OOM). Add robust tika-batch code -- Key: TIKA-1330 URL: https://issues.apache.org/jira/browse/TIKA-1330 Project: Tika Issue Type: Sub-task Components: cli, general, server Reporter: Tim Allison Assignee: Tim Allison In my current design plan, I see creating a separate component tika-batch that includes a small bit of configurable code to run Tika against a large batch of documents. This code should be robust against OOM and hangs, and it should have fairly robust logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)