Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "TikaBatchUsage" page has been changed by TimothyAllison: https://wiki.apache.org/tika/TikaBatchUsage?action=diff&rev1=5&rev2=6 = Usage = See TikaBatchOverview for a general design overview of tika-batch. - This is all still very much in a dev state. The code is currently available [[https://github.com/tballison/tika/tree/TIKA-1302|here]]. The current goal is to get this into decent enough shape to make it into Tika 1.8. + The code is currently available [[https://github.com/tballison/tika/tree/TIKA-1330|here]]. The current goal is to add this into Tika 1.8. - == TikaBatch FileSystem (FS) == For expert users who don't want to use tika-app or who might want to do custom extensions, there are example driver files and logging config files available in [[https://github.com/tballison/tika/tree/TIKA-1302/tika-batch/src/main/examples|here]]. @@ -15, +14 @@ You can see the commandline arguments via the regular "-?" or "--help" commands. There is a separate section at the end for tika-batch options. In the current dev version. Tika-app decides if it is in batch mode based on one of two signals: - 1. The final argument in the commandline args is a directory + + 1. There are only two arguments and the first one is an existing directory + - 2. -srcDir is specified in the commandline + 2. -inputDir or -i is specified in the commandline Once the app knows that it is in batch mode, it converts some of the traditional tika-app commandline arguments for use by org.apache.tika.batch.fs.FSBatchProcessCLI. @@ -24, +25 @@ *Most basic (with output to a directory called "output"): - java -jar tika-app.X.Y.jar <inputDirectory> + java -jar tika-app.X.Y.jar <inputDirectory> <outputDirectory> + + *Specify input and output directories: + + java -jar tika-app.X.Y.jar -i /mydata/src/dir -o /mydata/output/dir *Set the number of file consumer threads: - java -jar tika-app.X.Y.jar -numConsumers 10 <inputDirectory> + java -jar tika-app.X.Y.jar -numConsumers 10 -i <inputDirectory> -o <outputDirectory> - *Specify input and output directories: + *Output text instead of xml - java -jar tika-app.X.Y.jar -srcDir /mydata/src/dir -targDir /mydata/output/dir + java -jar tika-app.X.Y.jar -t -i <inputDirectory> -o <outputDirectory> + + *Use the RecursiveParserWrapper and store text for each document: + java -jar tika-app.X.Y.jar -J -t -i <inputDirectory> -o <outputDirectory> *Specify jvm args to be used by the child process (prepend a "J" to the regular args): - java -jar tika-app.X.Y.jar -JXmx2g -JDlog4j.configuration={{file:bin/log4j.xml}} <inputDirectory> + java -jar tika-app.X.Y.jar -JXmx2g -JDlog4j.configuration={{file:bin/log4j.xml}} -i <inputDirectory> -o <outputDirectory> *Commandline to generate output files for tika-eval...only process those files listed in pdfs_random_50000.csv: - java -Dlog4j.configuration=file:bin/log4j_driver.xml -jar tika-app-X.Y.jar -JXmx6g -JDlog4j.configuration=file:bin/log4j.xml -bc tika-batch-config-basic-test.xml -numConsumers 10 -targDir <targDir> -srcDir <srcDir> -fileList pdfs_random_50000.csv + java -Dlog4j.configuration=file:bin/log4j_driver.xml -jar tika-app-X.Y.jar -JXmx6g -JDlog4j.configuration=file:bin/log4j.xml -bc tika-batch-config-basic-test.xml -numConsumers 10 -o <outputDirectory> -i <inputDirectory> -fileList pdfs_random_50000.csv - + === Some notes === + + *The watchdog process will restart the child process unless the child process exits with a "do not restart value"=666. If you want to kill all processing, make sure to kill the parent process and then the child process. + + *Because of a feature in javax's xml parser and the way the parser is configured in PDFBox 1.8.8, it is common to see error messages (e.g. Fatal Error :21:9: Element type "pdfx:" must be followed by either attribute specifications, ">" or "/>"). That should go away when Tika migrates to PDFBox 2.0. + == TikaBatch Server == Module not yet implemented...want to contribute?
