Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "TikaBatchUsage" page has been changed by TimothyAllison: https://wiki.apache.org/tika/TikaBatchUsage?action=diff&rev1=6&rev2=7 = Usage = See TikaBatchOverview for a general design overview of tika-batch. - The code is currently available [[https://github.com/tballison/tika/tree/TIKA-1330|here]]. The current goal is to add this into Tika 1.8. + tika-batch is now available in trunk and will be available in Tika 1.8. == TikaBatch FileSystem (FS) == For expert users who don't want to use tika-app or who might want to do custom extensions, there are example driver files and logging config files available in [[https://github.com/tballison/tika/tree/TIKA-1302/tika-batch/src/main/examples|here]]. @@ -39, +39 @@ java -jar tika-app.X.Y.jar -t -i <inputDirectory> -o <outputDirectory> - *Use the RecursiveParserWrapper and store text for each document: + *Use the !RecursiveParserWrapper and store text for each document: java -jar tika-app.X.Y.jar -J -t -i <inputDirectory> -o <outputDirectory> *Specify jvm args to be used by the child process (prepend a "J" to the regular args): @@ -52, +52 @@ === Some notes === - *The watchdog process will restart the child process unless the child process exits with a "do not restart value"=666. If you want to kill all processing, make sure to kill the parent process and then the child process. + *The watchdog process will restart the child process unless the child process exits with a "do not restart value"=254. If you want to kill all processing, make sure to kill the parent process and then the child process. - *Because of a feature in javax's xml parser and the way the parser is configured in PDFBox 1.8.8, it is common to see error messages (e.g. Fatal Error :21:9: Element type "pdfx:" must be followed by either attribute specifications, ">" or "/>"). That should go away when Tika migrates to PDFBox 2.0. == TikaBatch Server == @@ -64, +63 @@ == TikaBatch Hadoop == Module not yet implemented within Tika project...want to contribute? + See TikaInHadoop. - Some external project links and blogs: - *[[http://svn.apache.org/repos/asf/oodt/trunk/crawler|Apache OODT Crawler]] - *[[https://github.com/DigitalPebble/behemoth|DigitalPebble]] - *[[http://openpreservation.org/knowledge/blogs/2014/03/21/tika-ride-characterising-web-content-nanite/|Nanite]]
