[Tika Wiki] Trivial Update of "TikaBatchUsage" by TimothyAllison

Apache Wiki Tue, 10 Mar 2015 18:58:41 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "TikaBatchUsage" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/TikaBatchUsage?action=diff&rev1=5&rev2=6

  = Usage =
  See TikaBatchOverview for a general design overview of tika-batch.
  
- This is all still very much in a dev state.  The code is currently available 
[[https://github.com/tballison/tika/tree/TIKA-1302|here]].  The current goal is 
to get this into decent enough shape to make it into Tika 1.8.
+ The code is currently available 
[[https://github.com/tballison/tika/tree/TIKA-1330|here]].  The current goal is 
to add this into Tika 1.8.
- 
  
  == TikaBatch FileSystem (FS) ==
  For expert users who don't want to use tika-app or who might want to do 
custom extensions, there are example driver files and logging config files 
available in 
[[https://github.com/tballison/tika/tree/TIKA-1302/tika-batch/src/main/examples|here]].
@@ -15, +14 @@

  You can see the commandline arguments via the regular "-?" or "--help" 
commands.  There is a separate section at the end for tika-batch options.
  
  In the current dev version.  Tika-app decides if it is in batch mode based on 
one of two signals:
- 1. The final argument in the commandline args is a directory
+ 
+ 1. There are only two arguments and the first one is an existing directory
+ 
- 2. -srcDir is specified in the commandline
+ 2. -inputDir or -i is specified in the commandline
  
  Once the app knows that it is in batch mode, it converts some of the 
traditional tika-app commandline arguments for use by 
org.apache.tika.batch.fs.FSBatchProcessCLI.
  
@@ -24, +25 @@

  
   *Most basic (with output to a directory called "output"):
  
-       java -jar tika-app.X.Y.jar <inputDirectory>
+       java -jar tika-app.X.Y.jar <inputDirectory> <outputDirectory>
+ 
+  *Specify input and output directories:
+ 
+       java -jar tika-app.X.Y.jar -i /mydata/src/dir -o /mydata/output/dir
  
   *Set the number of file consumer threads:
  
-       java -jar tika-app.X.Y.jar -numConsumers 10 <inputDirectory>
+       java -jar tika-app.X.Y.jar -numConsumers 10 -i <inputDirectory> -o 
<outputDirectory>
  
-  *Specify input and output directories:
+  *Output text instead of xml
  
-       java -jar tika-app.X.Y.jar -srcDir /mydata/src/dir -targDir 
/mydata/output/dir
+       java -jar tika-app.X.Y.jar -t -i <inputDirectory> -o <outputDirectory>
+ 
+  *Use the RecursiveParserWrapper and store text for each document:
+       java -jar tika-app.X.Y.jar -J -t -i <inputDirectory> -o 
<outputDirectory>
  
   *Specify jvm args to be used by the child process (prepend a "J" to the 
regular args):
  
-       java -jar tika-app.X.Y.jar -JXmx2g 
-JDlog4j.configuration={{file:bin/log4j.xml}} <inputDirectory>
+       java -jar tika-app.X.Y.jar -JXmx2g 
-JDlog4j.configuration={{file:bin/log4j.xml}} -i <inputDirectory> -o 
<outputDirectory>
  
   *Commandline to generate output files for tika-eval...only process those 
files listed in pdfs_random_50000.csv:
-       java -Dlog4j.configuration=file:bin/log4j_driver.xml -jar 
tika-app-X.Y.jar -JXmx6g -JDlog4j.configuration=file:bin/log4j.xml -bc 
tika-batch-config-basic-test.xml -numConsumers 10 -targDir <targDir> -srcDir 
<srcDir> -fileList pdfs_random_50000.csv
+       java -Dlog4j.configuration=file:bin/log4j_driver.xml -jar 
tika-app-X.Y.jar -JXmx6g -JDlog4j.configuration=file:bin/log4j.xml -bc 
tika-batch-config-basic-test.xml -numConsumers 10 -o <outputDirectory> -i 
<inputDirectory> -fileList pdfs_random_50000.csv
- 
  
        
+ === Some notes ===
+ 
+  *The watchdog process will restart the child process unless the child 
process exits with a "do not restart value"=666.  If you want to kill all 
processing, make sure to kill the parent process and then the child process.
+ 
+  *Because of a feature in javax's xml parser and the way the parser is 
configured in PDFBox 1.8.8, it is common to see error messages (e.g. Fatal 
Error :21:9: Element type "pdfx:" must be followed by either attribute 
specifications, ">" or "/>").  That should go away when Tika migrates to PDFBox 
2.0.
+ 
  
  == TikaBatch Server ==
  Module not yet implemented...want to contribute?

[Tika Wiki] Trivial Update of "TikaBatchUsage" by TimothyAllison

Reply via email to