[ https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516385#comment-17516385 ]
Luís Filipe Nassif commented on TIKA-3711: ------------------------------------------ Well, when reading the document in its native format, users will see the embedded images, but of course it's not text. Regarding filenames, "image1", "image2" may not be useful, but "bank transfer receipt", "qrcode for payment" may be very useful... > I don't think more information for its own sake is necessarily good It's not, that's why I said this is use case specific... In our project we already have our own content handler to output embedded filenames since a long ago, so this change wouldn't affect us. But my point about supressing current (intended) output info, even making sense, is that it could break other users, so my weak suggestion is to have an option to enable previous behavior. > Image file names included in parsed Word Document text > ------------------------------------------------------ > > Key: TIKA-3711 > URL: https://issues.apache.org/jira/browse/TIKA-3711 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 2.3.0 > Reporter: Sam Stephens > Priority: Minor > Attachments: word-doc-with-image.docx > > > The attached Word document includes nothing but a single image. Running it > through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it > through the Tika 2.3.0 AutoDetectParser returns the text: > {{image1.png}} > -- This message was sent by Atlassian Jira (v8.20.1#820001)