[jira] [Commented] (TIKA-704) PDF and Outlook docs embedded in MS Word documents not parsed

Jeremy Anderson (JIRA) Wed, 07 Sep 2011 06:10:42 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13098920#comment-13098920
 ]


Jeremy Anderson commented on TIKA-704:
--------------------------------------

Thanks for the fast attention for the fix.  Didn't realize that these type of 
test files can or are desired to be included... I'll either change the license 
on those already loaded (MSG example) and/or re-uupload (a better PDF example).

Thanks again!!!

> PDF and Outlook docs embedded in MS Word documents not parsed
> -------------------------------------------------------------
>
>                 Key: TIKA-704
>                 URL: https://issues.apache.org/jira/browse/TIKA-704
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>         Environment: Windows 7 64-bit
>            Reporter: Jeremy Anderson
>            Assignee: Jukka Zitting
>             Fix For: 1.0
>
>         Attachments: TestWithOutlook.docx, TestWithPdf.docx, 
> recursiveUsage.txt
>
>
> Currently there appear to be issues with embedded pdf's and outlook Msg files 
> contained in MS Word documents. I'll attach a sample for each and my 
> recursive parser (incase the problem lies in there).
> From what I see, when these embedded objects are parsed, they're initially 
> identified as vnd.openxmlformats-officedocument.oleObject in the metadata's 
> Content-Type field. After a call to the RecurciveParsers super parse class 
> the Content-Types update to the following:
> PDF's: application/vnd.ms-works
> .MSG: application/x-tika-msoffice
> The internal AutoDetectParser is unable to properly identify these PDF's and 
> therfore does not call the PDFParser on them.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-704) PDF and Outlook docs embedded in MS Word documents not parsed

Reply via email to