Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "TikaBatchUsage" page has been changed by TimothyAllison: https://wiki.apache.org/tika/TikaBatchUsage?action=diff&rev1=7&rev2=8 = Usage = See TikaBatchOverview for a general design overview of tika-batch. - tika-batch is now available in trunk and will be available in Tika 1.8. + tika-batch was added to Tika 1.8 as its own package, and it was integrated into tika-app with 1.8 as well. - == TikaBatch FileSystem (FS) == - For expert users who don't want to use tika-app or who might want to do custom extensions, there are example driver files and logging config files available in [[https://github.com/tballison/tika/tree/TIKA-1302/tika-batch/src/main/examples|here]]. == TikaBatch via tika-app-X.Y.jar == - There is an initial integration with tika-app on a github [[https://github.com/tballison/tika/tree/TIKA-1302|fork]]. You can see the commandline arguments via the regular "-?" or "--help" commands. There is a separate section at the end for tika-batch options. @@ -25, +22 @@ *Most basic (with output to a directory called "output"): - java -jar tika-app.X.Y.jar <inputDirectory> <outputDirectory> + `java -jar tika-app.X.Y.jar <inputDirectory> <outputDirectory>` *Specify input and output directories: - java -jar tika-app.X.Y.jar -i /mydata/src/dir -o /mydata/output/dir + `java -jar tika-app.X.Y.jar -i /mydata/src/dir -o /mydata/output/dir` *Set the number of file consumer threads: - java -jar tika-app.X.Y.jar -numConsumers 10 -i <inputDirectory> -o <outputDirectory> + `java -jar tika-app.X.Y.jar -numConsumers 10 -i <inputDirectory> -o <outputDirectory>` *Output text instead of xml - java -jar tika-app.X.Y.jar -t -i <inputDirectory> -o <outputDirectory> + `java -jar tika-app.X.Y.jar -t -i <inputDirectory> -o <outputDirectory>` *Use the !RecursiveParserWrapper and store text for each document: - java -jar tika-app.X.Y.jar -J -t -i <inputDirectory> -o <outputDirectory> + `java -jar tika-app.X.Y.jar -J -t -i <inputDirectory> -o <outputDirectory>` *Specify jvm args to be used by the child process (prepend a "J" to the regular args): - java -jar tika-app.X.Y.jar -JXmx2g -JDlog4j.configuration={{file:bin/log4j.xml}} -i <inputDirectory> -o <outputDirectory> + `java -jar tika-app.X.Y.jar -JXmx2g -JDlog4j.configuration=log4j.xml -i <inputDirectory> -o <outputDirectory>` *Commandline to generate output files for tika-eval...only process those files listed in pdfs_random_50000.csv: - java -Dlog4j.configuration=file:bin/log4j_driver.xml -jar tika-app-X.Y.jar -JXmx6g -JDlog4j.configuration=file:bin/log4j.xml -bc tika-batch-config-basic-test.xml -numConsumers 10 -o <outputDirectory> -i <inputDirectory> -fileList pdfs_random_50000.csv + `java -Dlog4j.debug -Dlog4j.configuration=file:log4j_driver.xml -cp "bin/*" org.apache.tika.cli.TikaCLI -JXX:-OmitStackTraceInFastThrow -JXmx5g -JDlog4j.configuration=file:log4j.xml -bc tika-batch-config-basic-test.xml -i <input_directory> -o <output_directory> -fileList pdfs_random_50000.csv` + === Some notes === *The watchdog process will restart the child process unless the child process exits with a "do not restart value"=254. If you want to kill all processing, make sure to kill the parent process and then the child process. - + *Make sure to add -JXX:-OmitStackTraceInFastThrow to the child process's commandline arguments so that Java doesn't swallow your stack traces. == TikaBatch Server == Module not yet implemented...want to contribute?
