I have a PDF document with a docx attachment. I wasn't having luck getting the contents of the docx with tika.parseToString(file).
I dug around a bit in the PDFExtractor and found that when I changed this line: embeddedExtractor.parseEmbedded( stream, new EmbeddedContentHandler(new BodyContentHandler(localHandler)), metadata, false); to: embeddedExtractor.parseEmbedded( stream, new EmbeddedContentHandler(handler), metadata, false); in other words, when I no longer required "body" elements, I was able to get the content of the attached document. I attached the same inner document to a docx file and had luck without this change. Does anyone know why this change is required in PDFExtractor? Is this a bad solution? Unfortunately, I can't share the documents. Best, Tim