[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text

2022-05-17 Thread Sam Stephens (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538528#comment-17538528
 ] 

Sam Stephens commented on TIKA-3711:


Thanks [~tallison] , confirmed this is working for me as expected.

> Image file names included in parsed Word Document text
> --
>
> Key: TIKA-3711
> URL: https://issues.apache.org/jira/browse/TIKA-3711
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: word-doc-with-image-from-word-365.docx, 
> word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it 
> through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it 
> through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text

2022-04-08 Thread Jira


[ 
https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17519857#comment-17519857
 ] 

Luís Filipe Nassif commented on TIKA-3711:
--

Great! Thank you [~tallison] for all your hard work with Tika. And sorry if my 
comments made things harder to be implemented.

> Image file names included in parsed Word Document text
> --
>
> Key: TIKA-3711
> URL: https://issues.apache.org/jira/browse/TIKA-3711
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: word-doc-with-image-from-word-365.docx, 
> word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it 
> through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it 
> through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text

2022-04-08 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17519802#comment-17519802
 ] 

Hudson commented on TIKA-3711:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #512 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/512/])
TIKA-3711 -- allow configuration of EmbeddedDocumentExtractors via 
tika-config.xml (tallison: 
[https://github.com/apache/tika/commit/ccc7bd841e097c3aa6d0c7c8494ddc5fa7596619])
* (edit) 
tika-core/src/main/java/org/apache/tika/extractor/ParsingEmbeddedDocumentExtractor.java
* (edit) 
tika-core/src/main/java/org/apache/tika/parser/AutoDetectParserConfig.java
* (add) 
tika-core/src/main/java/org/apache/tika/extractor/ParsingEmbeddedDocumentExtractorFactory.java
* (edit) tika-core/src/main/java/org/apache/tika/parser/AutoDetectParser.java
* (edit) 
tika-core/src/main/java/org/apache/tika/extractor/EmbeddedDocumentUtil.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/configs/tika-config-with-names.xml
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/configs/tika-config-no-names.xml
* (add) 
tika-core/src/main/java/org/apache/tika/extractor/EmbeddedDocumentExtractorFactory.java
TIKA-3711 -- allow configuration of EmbeddedDocumentExtractors via 
tika-config.xml -- review and correct places where outputHtml should be false. 
(tallison: 
[https://github.com/apache/tika/commit/6552b076f0b4987423710b72b8917150422ea112])
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/pst/OutlookPSTParser.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/pkg/ZipParserTest.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/onenote/OneNoteTreeWalker.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/xml/WordMLParser.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/AbstractPOIFSExtractor.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/JackcessExtractor.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/microsoft/XML2003ParserTest.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/ImageGraphicsEngine.java
* (edit) CHANGES.txt
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java


> Image file names included in parsed Word Document text
> --
>
> Key: TIKA-3711
> URL: https://issues.apache.org/jira/browse/TIKA-3711
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: word-doc-with-image-from-word-365.docx, 
> word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it 
> through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it 
> through the Tika 2.3.0 AutoDetectParser returns the 

[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text

2022-04-08 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17519776#comment-17519776
 ] 

Tim Allison commented on TIKA-3711:
---

[~samstephens] and all Tika users, I'm sorry for the pain.  

The added configurability with the parsing embedded document extractor opens up 
some potential areas for useful new capabilities.

> Image file names included in parsed Word Document text
> --
>
> Key: TIKA-3711
> URL: https://issues.apache.org/jira/browse/TIKA-3711
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: word-doc-with-image-from-word-365.docx, 
> word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it 
> through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it 
> through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text

2022-04-08 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17519774#comment-17519774
 ] 

Tim Allison commented on TIKA-3711:
---

In reviewing the commit above, there were quite a few places where outputHtml 
should have been false.  I've fixed those. 

The general need still remained though, to allow users to turn off the 
reporting of embedded file names in the handler's content.  I've now made the 
EmbeddedDocumentExtractor configurable from tika-config.xml.

This is an example of how to do that:
{code}

  

  
  

  
false
  

  

{code}


> Image file names included in parsed Word Document text
> --
>
> Key: TIKA-3711
> URL: https://issues.apache.org/jira/browse/TIKA-3711
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: word-doc-with-image-from-word-365.docx, 
> word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it 
> through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it 
> through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text

2022-04-08 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17519749#comment-17519749
 ] 

Tim Allison commented on TIKA-3711:
---

I'm going to address this in two commits. The first will add configurability 
for writing the file names to streams.  The second commit will be a review of 
the offending commit: 
https://github.com/apache/tika/commit/118734a1317fa13ad66959fdc28969ca50a49643 
-- I need to review cases where the calling parser has already written xhtml 
tags.

> Image file names included in parsed Word Document text
> --
>
> Key: TIKA-3711
> URL: https://issues.apache.org/jira/browse/TIKA-3711
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: word-doc-with-image-from-word-365.docx, 
> word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it 
> through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it 
> through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text

2022-04-05 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17517385#comment-17517385
 ] 

Tim Allison commented on TIKA-3711:
---

Y, we need to make this configurable.  I'll have time to work on this later 
this week.

> Image file names included in parsed Word Document text
> --
>
> Key: TIKA-3711
> URL: https://issues.apache.org/jira/browse/TIKA-3711
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Minor
> Attachments: word-doc-with-image-from-word-365.docx, 
> word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it 
> through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it 
> through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text

2022-04-04 Thread Sam Stephens (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17517167#comment-17517167
 ] 

Sam Stephens commented on TIKA-3711:


Regarding filenames, I don't think they will ever be semantically meaningful. I 
just created a document with Word 365 (uploaded as 
word-doc-with-image-from-word-365.docx), added a picture with the filename 
test-image.png, and the extracted filename is still image1.png. I think Word is 
creating non-interesting filenames.

As far as breaking other users, I'm raising this bug because this *is* a change 
in behavior that's broken me. I'm relying on the Tika 2.2.0 behavior where 
images are not part of the text.

Using the ToXMLContentHandler to look at the actual generated HTML, I think 
that's also going to be surprising behavior for most users.

{{}}
{{image.png}}
{{}}

My image actually has alt text; that's not included. And I think including the 
image file name as a header in the markup is going to be surprising to almost 
every user. It certainly doesn't match the source document (which has no 
headers, or visible text of any kind).

As an end user, what I'd like is the XHTML to be

{{}}
{{}}

And the text from BodyContentHandler to not include the image at all. That way 
the text is the text, and if I have an interest in Image alt tags, I can 
operate on the XHTML.

If you wanted to include an option to provide text for the image, I don't think 
image filenames will ever be useful from Word; alt text is the right place 
semantically to be looking for a textual representation of an image.

> Image file names included in parsed Word Document text
> --
>
> Key: TIKA-3711
> URL: https://issues.apache.org/jira/browse/TIKA-3711
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Minor
> Attachments: word-doc-with-image-from-word-365.docx, 
> word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it 
> through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it 
> through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text

2022-04-03 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516459#comment-17516459
 ] 

Nick Burch commented on TIKA-3711:
--

I'd lean towards putting the file name as an attribute of the img tag, along 
with the description as the alt text if the format supports it

> Image file names included in parsed Word Document text
> --
>
> Key: TIKA-3711
> URL: https://issues.apache.org/jira/browse/TIKA-3711
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Minor
> Attachments: word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it 
> through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it 
> through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text

2022-04-02 Thread Jira


[ 
https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516385#comment-17516385
 ] 

Luís Filipe Nassif commented on TIKA-3711:
--

Well, when reading the document in its native format, users will see the 
embedded images, but of course it's not text. Regarding filenames, "image1", 
"image2" may not be useful, but "bank transfer receipt", "qrcode for payment" 
may be very useful...

 

> I don't think more information for its own sake is necessarily good

It's not, that's why I said this is use case specific...

 

In our project we already have our own content handler to output embedded 
filenames since a long ago, so this change wouldn't affect us. But my point 
about supressing current (intended) output info, even making sense, is that it 
could break other users, so my weak suggestion is to have an option to enable 
previous behavior.

> Image file names included in parsed Word Document text
> --
>
> Key: TIKA-3711
> URL: https://issues.apache.org/jira/browse/TIKA-3711
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Minor
> Attachments: word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it 
> through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it 
> through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text

2022-04-01 Thread Sam Stephens (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516141#comment-17516141
 ] 

Sam Stephens commented on TIKA-3711:


I guess the question is what are the semantics of this operation? When I ask 
for the text of a document, what does that actually mean?

As an end user, I'd argue the semantics that are most useful to end users is 
that getting the text of a document provides the closest possible 
representation of the text a user would read when reading the document in its 
native format.

By this argument, the image filenames should not be there, because I wouldn't 
see image filenames if I was reading the Word document from within Word.

 

I don't think more information for its own sake is necessarily good. If I argue 
this from a reductio ad absurdum perspective, I'd then say that adding text 
describing all document formatting is useful. Adding the words "Heading 1" each 
time there's a heading, "Bold" and "Unbold" each time a bolded section occurs. 
This is clearly more information, but it's also clear that adding this 
information would rapidly make the text you extracted from a Word document 
unusable.

 

>From an end user perspective, I'm using this text extraction so I can put 
>documents in a search index. Having the terms "image1", "image2" etc show up 
>in my index for documents that contain images is not useful behavior, unless 
>that actually occurs in the real text of the document.

The image filenames are metadata. If I wanted that metadata, I can engage with 
the full XHTML representation of the document to get it. But my take is that 
BodyContentHandler should give me text, not metadata.

 

> Image file names included in parsed Word Document text
> --
>
> Key: TIKA-3711
> URL: https://issues.apache.org/jira/browse/TIKA-3711
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Minor
> Attachments: word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it 
> through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it 
> through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text

2022-04-01 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516029#comment-17516029
 ] 

Tim Allison commented on TIKA-3711:
---

Y, I totally hear you [~lfcnassif].  I think we have areas for improvement in 
configuring the handlers.  We currently have no way to specify which 
contenthandler is selected in tika-config.xml.  We have no way to specify 
configurations for embedded document parsers (aside from the write limits etc.).

> Image file names included in parsed Word Document text
> --
>
> Key: TIKA-3711
> URL: https://issues.apache.org/jira/browse/TIKA-3711
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Minor
> Attachments: word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it 
> through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it 
> through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text

2022-04-01 Thread Jira


[ 
https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17515974#comment-17515974
 ] 

Luís Filipe Nassif commented on TIKA-3711:
--

Well, IMHO I think the user may be interested to check the image in the 
original document if he/she knows there is some (non ocred) image in a specific 
text location. This for sure could be done with a specific xml handler impl 
like you said, but if current behavior are going to be changed to suppress 
current text output, maybe an option to enable it again could help...

> Image file names included in parsed Word Document text
> --
>
> Key: TIKA-3711
> URL: https://issues.apache.org/jira/browse/TIKA-3711
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Minor
> Attachments: word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it 
> through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it 
> through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text

2022-04-01 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17515956#comment-17515956
 ] 

Tim Allison commented on TIKA-3711:
---

[~lfcnassif], it feels inelegant to write embedded filenames in the text.  
Would it not be enough to include that in the xhtml markup?

> Image file names included in parsed Word Document text
> --
>
> Key: TIKA-3711
> URL: https://issues.apache.org/jira/browse/TIKA-3711
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Minor
> Attachments: word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it 
> through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it 
> through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text

2022-04-01 Thread Jira


[ 
https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17515941#comment-17515941
 ] 

Luís Filipe Nassif commented on TIKA-3711:
--

I strongly prefer current behavior, that returns more information to the user. 
Of course this is use case specific and maybe an option could exist to disable 
that.

> Image file names included in parsed Word Document text
> --
>
> Key: TIKA-3711
> URL: https://issues.apache.org/jira/browse/TIKA-3711
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it 
> through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it 
> through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text

2022-04-01 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17515919#comment-17515919
 ] 

Tim Allison commented on TIKA-3711:
---

I introduced that change because some parsers were including it and some were 
not.  So we had different behavior for different file types, which was less 
than ideal.

I included this bullet in the CHANGES.txt file as an alert to changed behavior:

bq.* Improve consistency in reporting package-entry divs across all parsers 
for embedded files (TIKA-3644). This leads to some more text (embedded file 
names) in files with many embedded attachments.

We can change the behavior to "include the file name only in xhtml attributes" 
which will not show up in text.  But we should do that consistently for all 
file types.

Fellow devs, what do you think?

> Image file names included in parsed Word Document text
> --
>
> Key: TIKA-3711
> URL: https://issues.apache.org/jira/browse/TIKA-3711
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it 
> through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it 
> through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)