Hi All

While I've experimented with writing a simple demo code which creates a Tika PDFParser (and few other parsers) and provides a ToTextContentHandler for it to return the content, I'm realizing I'm not really quite sure what the best strategy is.

For example, Tim has mentioned that it is possible to handle embedded PDF attachments - I don't even know what they are, to me every PDF is just a text when I look at it :-). Besides I'm not sure if ToTextContentHandler is not missing some content.

Here is the basic code I have:

PDFParser parser = new PDFParser();
Metadata m = new Metadata();
ParseContext context = new ParseContext();
ToTextContentHandler contentHandler = new ToTextContentHandler();
parser.parse(pdfInputStream, contentHandler, m, context);

String content = contentHandler.toString();
// work with the returned content, and filled-in Metadata

Is this code good enough to get all the content (and metadata) out of a 'simple' PDF ?

How to enhance this code to handle the embedded attachments too ? Ideally such that it continues supporting both 'simple' and 'complex' PDFs.

I'd like to understand it better so that I can enhance out CXF Tika integration code a bit

Thanks, Sergey

Reply via email to