Hi All
While I've experimented with writing a simple demo code which creates a
Tika PDFParser (and few other parsers) and provides a
ToTextContentHandler for it to return the content, I'm realizing I'm not
really quite sure what the best strategy is.
For example, Tim has mentioned that it is possible to handle embedded
PDF attachments - I don't even know what they are, to me every PDF is
just a text when I look at it :-). Besides I'm not sure if
ToTextContentHandler is not missing some content.
Here is the basic code I have:
PDFParser parser = new PDFParser();
Metadata m = new Metadata();
ParseContext context = new ParseContext();
ToTextContentHandler contentHandler = new ToTextContentHandler();
parser.parse(pdfInputStream, contentHandler, m, context);
String content = contentHandler.toString();
// work with the returned content, and filled-in Metadata
Is this code good enough to get all the content (and metadata) out of a
'simple' PDF ?
How to enhance this code to handle the embedded attachments too ?
Ideally such that it continues supporting both 'simple' and 'complex' PDFs.
I'd like to understand it better so that I can enhance out CXF Tika
integration code a bit
Thanks, Sergey