[ 
https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516385#comment-17516385
 ] 

Luís Filipe Nassif commented on TIKA-3711:
------------------------------------------

Well, when reading the document in its native format, users will see the 
embedded images, but of course it's not text. Regarding filenames, "image1", 
"image2" may not be useful, but "bank transfer receipt", "qrcode for payment" 
may be very useful...

 

> I don't think more information for its own sake is necessarily good

It's not, that's why I said this is use case specific...

 

In our project we already have our own content handler to output embedded 
filenames since a long ago, so this change wouldn't affect us. But my point 
about supressing current (intended) output info, even making sense, is that it 
could break other users, so my weak suggestion is to have an option to enable 
previous behavior.

> Image file names included in parsed Word Document text
> ------------------------------------------------------
>
>                 Key: TIKA-3711
>                 URL: https://issues.apache.org/jira/browse/TIKA-3711
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 2.3.0
>            Reporter: Sam Stephens
>            Priority: Minor
>         Attachments: word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it 
> through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it 
> through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to