[ https://issues.apache.org/jira/browse/TIKA-704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jeremy Anderson updated TIKA-704: --------------------------------- Attachment: LicensedTestWithPdf.docx LicensedTestWithOutlook.docx These are licensed versions... No yamaha manual this time > PDF and Outlook docs embedded in MS Word documents not parsed > ------------------------------------------------------------- > > Key: TIKA-704 > URL: https://issues.apache.org/jira/browse/TIKA-704 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.9 > Environment: Windows 7 64-bit > Reporter: Jeremy Anderson > Assignee: Jukka Zitting > Fix For: 1.0 > > Attachments: LicensedTestWithOutlook.docx, LicensedTestWithPdf.docx, > TestWithOutlook.docx, TestWithPdf.docx, recursiveUsage.txt > > > Currently there appear to be issues with embedded pdf's and outlook Msg files > contained in MS Word documents. I'll attach a sample for each and my > recursive parser (incase the problem lies in there). > From what I see, when these embedded objects are parsed, they're initially > identified as vnd.openxmlformats-officedocument.oleObject in the metadata's > Content-Type field. After a call to the RecurciveParsers super parse class > the Content-Types update to the following: > PDF's: application/vnd.ms-works > .MSG: application/x-tika-msoffice > The internal AutoDetectParser is unable to properly identify these PDF's and > therfore does not call the PDFParser on them. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira