[ https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516141#comment-17516141 ]
Sam Stephens commented on TIKA-3711: ------------------------------------ I guess the question is what are the semantics of this operation? When I ask for the text of a document, what does that actually mean? As an end user, I'd argue the semantics that are most useful to end users is that getting the text of a document provides the closest possible representation of the text a user would read when reading the document in its native format. By this argument, the image filenames should not be there, because I wouldn't see image filenames if I was reading the Word document from within Word. I don't think more information for its own sake is necessarily good. If I argue this from a reductio ad absurdum perspective, I'd then say that adding text describing all document formatting is useful. Adding the words "Heading 1" each time there's a heading, "Bold" and "Unbold" each time a bolded section occurs. This is clearly more information, but it's also clear that adding this information would rapidly make the text you extracted from a Word document unusable. >From an end user perspective, I'm using this text extraction so I can put >documents in a search index. Having the terms "image1", "image2" etc show up >in my index for documents that contain images is not useful behavior, unless >that actually occurs in the real text of the document. The image filenames are metadata. If I wanted that metadata, I can engage with the full XHTML representation of the document to get it. But my take is that BodyContentHandler should give me text, not metadata. > Image file names included in parsed Word Document text > ------------------------------------------------------ > > Key: TIKA-3711 > URL: https://issues.apache.org/jira/browse/TIKA-3711 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 2.3.0 > Reporter: Sam Stephens > Priority: Minor > Attachments: word-doc-with-image.docx > > > The attached Word document includes nothing but a single image. Running it > through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it > through the Tika 2.3.0 AutoDetectParser returns the text: > {{image1.png}} > -- This message was sent by Atlassian Jira (v8.20.1#820001)