The commandline I gave you outputs JSON files. If you open them in a text/JSON editor, you should see valid data. If they're corrupt, please let us know!
If you're able to process JSON files, you should be good to go. Otherwise, the recommendation to use Java's ZipFile API and do the unzipping yourself is probably the best option. In Tika, we do have a -z option to extract embedded files, but that only extracts the first level of documents and it doesn't reproduce the original file structure. If you have zips within zips, you won't get the content. -----Original Message----- From: davidgreen.co...@gmail.com [mailto:davidgreen.co...@gmail.com] On Behalf Of David Green Sent: Saturday, April 30, 2016 9:07 PM To: us...@pdfbox.apache.org Subject: Re: is it possible to batch extract text from pdf files within a tree of folders within a zip file ? sorry for using wrong forum is there a tika forum ? your suggested command is working of a fashion java -jar c:\jars\tika-app-1.12.jar -J -t -i f: -o g: the directory structure is being reproduced but the zip files are being copied as zip files (I think) the copied files retain the original filename (including the original zip extension) with an additional json extension though when I try to open the file using B1 file archiver, it reports a corrupt file.