[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219371#comment-14219371 ]
Tim Allison commented on TIKA-1302: ----------------------------------- HPC is way beyond current status of tika-batch, which is initially aimed at conventional/single-box computing. I heartily welcome tika-batch-hadoop and any other tika-batch-HPC packages!. If you do want to join in the effort on tika-batch, please do! I need plenty of help in code review, unit tests, usability and edge case (i.e bug) discovery. I'd also love to halve the amount of code and keep the robustness, extensibility and logging. You can grab my dev version of tika-batch from my github [fork|https://github.com/tballison/tika/tree/TIKA-1302]. See some background on the [wiki|https://wiki.apache.org/tika/TikaBatchOverview]. I finished an initial integration with tika-app, and you should be able to run tika-app with: {noformat} java -jar tika-app.jar <srcDirectory> {noformat} That will iterate through the srcDirectory and output files in an a directory named "output" There are lots of commandline arguments available. I'm going to update the usage [wiki|http://wiki.apache.org/tika/TikaBatchUsage] shortly, but the usual -? from the app will give you some of the options. > Let's run Tika against a large batch of docs nightly > ---------------------------------------------------- > > Key: TIKA-1302 > URL: https://issues.apache.org/jira/browse/TIKA-1302 > Project: Tika > Issue Type: Improvement > Components: cli, general, server > Reporter: Tim Allison > > Many thanks to [~lewismc] for TIKA-1301! Once we get nightly builds up and > running again, it might be fun to run Tika regularly against a large set of > docs and report metrics. > One excellent candidate corpus is govdocs1: > http://digitalcorpora.org/corpora/files. > Any other candidate corpora? > [~willp-bl], have anything handy you'd like to contribute? > [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite] > ;) -- This message was sent by Atlassian JIRA (v6.3.4#6332)