I should know the answer to this, but I don't. java -jar tika.jar -J will recursively analyze all files contained in the top level file, and java -jar tika.jar --extract will extract the top level files, but not any files embedded in those files (AFAIK). Is there a way to recursively extract all files (yea, unto the nth degree)?
Our use case is to be able to send all image files to Google's AI/OCR engine (which yields better results than tesseract), but process the remaining textual files with Tika -J. Alternatively, is it possible to replace tesseract with a call to Google's OCR engine (or some other OCR program). I'm guessing this might be possible with the new pipes feature, but I'm really having trouble understanding the process with pipes. Thanks.
