I have a PDF document with a docx attachment.  I wasn't having luck getting the 
contents of the docx with tika.parseToString(file).

I dug around a bit in the PDFExtractor and found that when I changed this line:
embeddedExtractor.parseEmbedded(
                                 stream,
new EmbeddedContentHandler(new BodyContentHandler(localHandler)),
                                                                metadata, 
false);
to:

embeddedExtractor.parseEmbedded(
                                 stream,
                                 new EmbeddedContentHandler(handler),
                                                                metadata, 
false);

in other words, when I no longer required "body" elements, I was able to get 
the content of the attached document.

I attached the same inner document to a docx file and had luck without this 
change.   Does anyone know why this change is required in PDFExtractor?  Is 
this a bad solution?

Unfortunately, I can't share the documents.

           Best,

               Tim

Reply via email to