Hi Tim
This is very helpful, thanks.
I'll experiment with the code below.
By the way, I've found out AutoDetectParser may not work if the (pdf)
stream is an attachment stream which may not support a mark.
I've been wondering, would it make sense to pass a MediaType identifying
the data format as either a ParseContext or Metadata property for
AutoDetectParser to avoid trying to read the stream ?
My demo works with PDF & ODT files, and before a parse call I already
know the media type
Thanks, Sergey
On 12/09/16 14:26, Allison, Timothy B. wrote:
Hi Sergey,
Is this code good enough to get all the content (and metadata) out of a
'simple' PDF ?
Yes, but...
For example, Tim has mentioned that it is possible to handle embedded PDF
attachments - I don't even know what they are, to me every PDF is just a text
when I look at it :-).
PDFs can have regular attachments (.doc,.ppt, etc, even other PDFs). There are
two traditional ways to get content from embedded files inlined in the xhtml:
Option 1 (for the 3 param call to parse):
Parser parser = new AutoDetectParser();
ToTextContentHandler contentHandler = new ToTextContentHandler();
Metadata m = new Metadata();
parser.parse(pdfInputStream, contentHandler, m); //3 param parse
Option 2 (for the 4 param call to parse):
Parser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
context.set(Parser.class, parser); //NEED TO ADD PARSER FOR EMBEDDED DOCS
ToTextContentHandler contentHandler = new ToTextContentHandler();
parser.parse(pdfInputStream, contentHandler, m, context); //4 param call to
parse
Another option is to use the RecursiveParserWrapper. This returns a
List<Metadata>, where the first Metadata object represents the container
document, and the subsequent Metadata objects represent embedded documents. The text
content for each document is stored in the RecursiveParserWrapper.TIKA_CONTENT field
within each Metadata object.
Option 3
Parser p = new AutoDetectParser();
RecursiveParserWrapper wrapper = new RecursiveParserWrapper(p,
new
BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.XML, -1));
try (InputStream is = getResourceAsStream("/test-documents/" +
filePath)) {
wrapper.parse(is, new DefaultHandler(), new Metadata(), context);
}
return wrapper.getMetadata();
-----Original Message-----
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
Sent: Friday, September 9, 2016 10:06 AM
To: user@tika.apache.org
Subject: How to parse PDF files effectively with Tika
Hi All
While I've experimented with writing a simple demo code which creates a Tika
PDFParser (and few other parsers) and provides a ToTextContentHandler for it to
return the content, I'm realizing I'm not really quite sure what the best
strategy is.
For example, Tim has mentioned that it is possible to handle embedded PDF
attachments - I don't even know what they are, to me every PDF is just a text
when I look at it :-). Besides I'm not sure if ToTextContentHandler is not
missing some content.
Here is the basic code I have:
PDFParser parser = new PDFParser();
Metadata m = new Metadata();
ParseContext context = new ParseContext(); ToTextContentHandler contentHandler
= new ToTextContentHandler(); parser.parse(pdfInputStream, contentHandler, m,
context);
String content = contentHandler.toString(); // work with the returned content,
and filled-in Metadata
Is this code good enough to get all the content (and metadata) out of a
'simple' PDF ?
How to enhance this code to handle the embedded attachments too ?
Ideally such that it continues supporting both 'simple' and 'complex' PDFs.
I'd like to understand it better so that I can enhance out CXF Tika integration
code a bit
Thanks, Sergey