How to parse PDF files effectively with Tika

Sergey Beryozkin Fri, 09 Sep 2016 07:06:25 -0700

Hi All

While I've experimented with writing a simple demo code which creates aTika PDFParser (and few other parsers) and provides aToTextContentHandler for it to return the content, I'm realizing I'm notreally quite sure what the best strategy is.

For example, Tim has mentioned that it is possible to handle embeddedPDF attachments - I don't even know what they are, to me every PDF isjust a text when I look at it :-). Besides I'm not sure ifToTextContentHandler is not missing some content.


Here is the basic code I have:

PDFParser parser = new PDFParser();
Metadata m = new Metadata();
ParseContext context = new ParseContext();
ToTextContentHandler contentHandler = new ToTextContentHandler();
parser.parse(pdfInputStream, contentHandler, m, context);

String content = contentHandler.toString();
// work with the returned content, and filled-in Metadata

Is this code good enough to get all the content (and metadata) out of a'simple' PDF ?

How to enhance this code to handle the embedded attachments too ?Ideally such that it continues supporting both 'simple' and 'complex' PDFs.

I'd like to understand it better so that I can enhance out CXF Tikaintegration code a bit


Thanks, Sergey

How to parse PDF files effectively with Tika

Reply via email to