[ 
https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515919#comment-17515919
 ] 

Tim Allison commented on TIKA-3711:
-----------------------------------

I introduced that change because some parsers were including it and some were 
not.  So we had different behavior for different file types, which was less 
than ideal.

I included this bullet in the CHANGES.txt file as an alert to changed behavior:

bq.    * Improve consistency in reporting package-entry divs across all parsers 
for embedded files (TIKA-3644). This leads to some more text (embedded file 
names) in files with many embedded attachments.

We can change the behavior to "include the file name only in xhtml attributes" 
which will not show up in text.  But we should do that consistently for all 
file types.

Fellow devs, what do you think?

> Image file names included in parsed Word Document text
> ------------------------------------------------------
>
>                 Key: TIKA-3711
>                 URL: https://issues.apache.org/jira/browse/TIKA-3711
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.3.0
>            Reporter: Sam Stephens
>            Priority: Major
>         Attachments: word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it 
> through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it 
> through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to