[ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14119745#comment-14119745 ]
Tim Allison commented on TIKA-1330: ----------------------------------- Looks like ballpark estimate on time for processing on TIKA-1302 was about right. I just finished a complete run of govdocs1 (~1 million files) on an 8 cpu vm with 8 gb available, -Xmx4g. The run used 15 consumers and completed in about 4 hours. The driver restarted the process thirteen times (6 permanent hangs and 7 OOM). > Add robust tika-batch code > -------------------------- > > Key: TIKA-1330 > URL: https://issues.apache.org/jira/browse/TIKA-1330 > Project: Tika > Issue Type: Sub-task > Components: cli, general, server > Reporter: Tim Allison > Assignee: Tim Allison > > In my current design plan, I see creating a separate component "tika-batch" > that includes a small bit of configurable code to run Tika against a large > batch of documents. This code should be robust against OOM and hangs, and it > should have fairly robust logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)