[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text

Sam Stephens (Jira) Fri, 01 Apr 2022 14:32:05 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516141#comment-17516141
 ]


Sam Stephens commented on TIKA-3711:
------------------------------------

I guess the question is what are the semantics of this operation? When I ask 
for the text of a document, what does that actually mean?

As an end user, I'd argue the semantics that are most useful to end users is 
that getting the text of a document provides the closest possible 
representation of the text a user would read when reading the document in its 
native format.

By this argument, the image filenames should not be there, because I wouldn't 
see image filenames if I was reading the Word document from within Word.

 

I don't think more information for its own sake is necessarily good. If I argue 
this from a reductio ad absurdum perspective, I'd then say that adding text 
describing all document formatting is useful. Adding the words "Heading 1" each 
time there's a heading, "Bold" and "Unbold" each time a bolded section occurs. 
This is clearly more information, but it's also clear that adding this 
information would rapidly make the text you extracted from a Word document 
unusable.

 

>From an end user perspective, I'm using this text extraction so I can put 
>documents in a search index. Having the terms "image1", "image2" etc show up 
>in my index for documents that contain images is not useful behavior, unless 
>that actually occurs in the real text of the document.

The image filenames are metadata. If I wanted that metadata, I can engage with 
the full XHTML representation of the document to get it. But my take is that 
BodyContentHandler should give me text, not metadata.

 

> Image file names included in parsed Word Document text
> ------------------------------------------------------
>
>                 Key: TIKA-3711
>                 URL: https://issues.apache.org/jira/browse/TIKA-3711
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 2.3.0
>            Reporter: Sam Stephens
>            Priority: Minor
>         Attachments: word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it 
> through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it 
> through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text

Reply via email to