RE: is it possible to batch extract text from pdf files within a tree of folders within a zip file ?

Allison, Timothy B. Mon, 02 May 2016 05:23:49 -0700

The commandline I gave you outputs JSON files.  If you open them in a text/JSON 
editor, you should see valid data.  If they're corrupt, please let us know!


If you're able to process JSON files, you should be good to go.  Otherwise, the 
recommendation to use Java's ZipFile API and do the unzipping yourself is 
probably the best option.  

In Tika, we do have a -z option to extract embedded files, but that only 
extracts the first level of documents and it doesn't reproduce the original 
file structure. If you have zips within zips, you won't get the content.

 
-----Original Message-----
From: davidgreen.co...@gmail.com [mailto:davidgreen.co...@gmail.com] On Behalf 
Of David Green
Sent: Saturday, April 30, 2016 9:07 PM
To: us...@pdfbox.apache.org
Subject: Re: is it possible to batch extract text from pdf files within a tree 
of folders within a zip file ?

sorry for using wrong forum
is there a tika forum ?

your suggested command is working of a fashion java -jar 
c:\jars\tika-app-1.12.jar -J -t -i f: -o g:
the directory structure is being reproduced but the zip files are being copied 
as zip files (I think) the copied files retain the original filename (including 
the original zip
extension) with an additional json extension though when I try to open the file 
using B1 file archiver, it reports a corrupt file.

RE: is it possible to batch extract text from pdf files within a tree of folders within a zip file ?

Reply via email to