Re: How to parse PDF files effectively with Tika

Sergey Beryozkin Mon, 12 Sep 2016 14:20:45 -0700

Hi Tim
This is very helpful, thanks.
I'll experiment with the code below.

By the way, I've found out AutoDetectParser may not work if the (pdf)stream is an attachment stream which may not support a mark.

I've been wondering, would it make sense to pass a MediaType identifyingthe data format as either a ParseContext or Metadata property forAutoDetectParser to avoid trying to read the stream ?My demo works with PDF & ODT files, and before a parse call I alreadyknow the media type


Thanks, Sergey
On 12/09/16 14:26, Allison, Timothy B. wrote:

Hi Sergey,

Is this code good enough to get all the content (and metadata) out of a 
'simple' PDF ?

Yes, but...

For example, Tim has mentioned that it is possible to handle embedded PDF 
attachments - I don't even know what they are, to me every PDF is just a text 
when I look at it :-).


PDFs can have regular attachments (.doc,.ppt, etc, even other PDFs).  There are 
two traditional ways to get content from embedded files inlined in the xhtml:

Option 1 (for the 3 param call to parse):
Parser parser = new AutoDetectParser();
ToTextContentHandler contentHandler = new ToTextContentHandler();
Metadata m = new Metadata();
parser.parse(pdfInputStream, contentHandler, m); //3 param parse

Option 2 (for the 4 param call to parse):
Parser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
context.set(Parser.class, parser); //NEED TO ADD PARSER FOR EMBEDDED DOCS
ToTextContentHandler contentHandler = new ToTextContentHandler();
parser.parse(pdfInputStream, contentHandler, m, context); //4 param call to 
parse

Another option is to use the RecursiveParserWrapper.  This returns a 
List<Metadata>, where the first Metadata object represents the container 
document, and the subsequent Metadata objects represent embedded documents.  The text 
content for each document is stored in the RecursiveParserWrapper.TIKA_CONTENT field 
within each Metadata object.

Option 3

        Parser p = new AutoDetectParser();
        RecursiveParserWrapper wrapper = new RecursiveParserWrapper(p,
                new 
BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.XML, -1));
        try (InputStream is = getResourceAsStream("/test-documents/" + 
filePath)) {
            wrapper.parse(is, new DefaultHandler(), new Metadata(), context);
        }
        return wrapper.getMetadata();

-----Original Message-----
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
Sent: Friday, September 9, 2016 10:06 AM
To: user@tika.apache.org
Subject: How to parse PDF files effectively with Tika

Hi All

While I've experimented with writing a simple demo code which creates a Tika 
PDFParser (and few other parsers) and provides a ToTextContentHandler for it to 
return the content, I'm realizing I'm not really quite sure what the best 
strategy is.

For example, Tim has mentioned that it is possible to handle embedded PDF 
attachments - I don't even know what they are, to me every PDF is just a text 
when I look at it :-). Besides I'm not sure if ToTextContentHandler is not 
missing some content.

Here is the basic code I have:

PDFParser parser = new PDFParser();
Metadata m = new Metadata();
ParseContext context = new ParseContext(); ToTextContentHandler contentHandler 
= new ToTextContentHandler(); parser.parse(pdfInputStream, contentHandler, m, 
context);

String content = contentHandler.toString(); // work with the returned content, 
and filled-in Metadata

Is this code good enough to get all the content (and metadata) out of a 
'simple' PDF ?

How to enhance this code to handle the embedded attachments too ?
Ideally such that it continues supporting both 'simple' and 'complex' PDFs.

I'd like to understand it better so that I can enhance out CXF Tika integration 
code a bit

Thanks, Sergey

Re: How to parse PDF files effectively with Tika

Reply via email to