[jira] [Commented] (TIKA-4307) Text in header not extracted for Microsoft Word doc file

Tim Allison (Jira) Thu, 02 Oct 2025 12:33:09 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18024343#comment-18024343
 ]


Tim Allison commented on TIKA-4307:
-----------------------------------

I don't have enough knowledge of the OLE2 doc format to fix this in a 
reasonable amount of time. I pinged on the POI issue. We'll see if a fellow dev 
has time. I'm sorry I don't have a better answer for you.

> Text in header not extracted for Microsoft Word doc file
> --------------------------------------------------------
>
>                 Key: TIKA-4307
>                 URL: https://issues.apache.org/jira/browse/TIKA-4307
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.9.2
>            Reporter: August Valera
>            Priority: Major
>         Attachments: 560702J-2x-converted.doc, 560702J-converted.docx, 
> 560702J-full-output.txt, 560702J.doc, screenshot-1.png
>
>
> We have a Microsoft Word doc file with text in the header. That header text 
> is not successfully extracted alongside the file content, but converting the 
> file to a docx file results in successful extraction.
> Samples are attached, conversion done using cloudconvert.com.
>  * [^560702J.doc] Original doc file, missing content
>  * [^560702J-converted.docx] Converted to docx file, correct output
>  * [^560702J-2x-converted.doc] Docx file converted back to doc, again missing 
> content
> h3. Current Behavior
> doc files omit header text. docx files extract header text correctly.
> h3. Expected Behavior
> doc and docx files with identical content in header should result in 
> identical output



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4307) Text in header not extracted for Microsoft Word doc file

Reply via email to