I should know the answer to this, but I don't. java -jar tika.jar -J will
recursively analyze all files contained in the top level file, and java
-jar tika.jar --extract will extract the top level files, but not any files
embedded in those files (AFAIK). Is there a way to recursively extract all
files (yea, unto the nth degree)?

Our use case is to be able to send all image files to Google's AI/OCR
engine (which yields better results than tesseract), but process the
remaining textual files with Tika -J.

Alternatively, is it possible to replace tesseract with a call to Google's
OCR engine (or some other OCR program).

I'm guessing this might be possible with the new pipes feature, but I'm
really having trouble understanding the process with pipes.

Thanks.

Reply via email to