[
https://issues.apache.org/jira/browse/TIKA-4307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18024343#comment-18024343
]
Tim Allison commented on TIKA-4307:
-----------------------------------
I don't have enough knowledge of the OLE2 doc format to fix this in a
reasonable amount of time. I pinged on the POI issue. We'll see if a fellow dev
has time. I'm sorry I don't have a better answer for you.
> Text in header not extracted for Microsoft Word doc file
> --------------------------------------------------------
>
> Key: TIKA-4307
> URL: https://issues.apache.org/jira/browse/TIKA-4307
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 2.9.2
> Reporter: August Valera
> Priority: Major
> Attachments: 560702J-2x-converted.doc, 560702J-converted.docx,
> 560702J-full-output.txt, 560702J.doc, screenshot-1.png
>
>
> We have a Microsoft Word doc file with text in the header. That header text
> is not successfully extracted alongside the file content, but converting the
> file to a docx file results in successful extraction.
> Samples are attached, conversion done using cloudconvert.com.
> * [^560702J.doc] Original doc file, missing content
> * [^560702J-converted.docx] Converted to docx file, correct output
> * [^560702J-2x-converted.doc] Docx file converted back to doc, again missing
> content
> h3. Current Behavior
> doc files omit header text. docx files extract header text correctly.
> h3. Expected Behavior
> doc and docx files with identical content in header should result in
> identical output
--
This message was sent by Atlassian Jira
(v8.20.10#820010)