[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text
[ https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538528#comment-17538528 ] Sam Stephens commented on TIKA-3711: Thanks [~tallison] , confirmed this is working for me as expected. > Image file names included in parsed Word Document text > -- > > Key: TIKA-3711 > URL: https://issues.apache.org/jira/browse/TIKA-3711 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.3.0 >Reporter: Sam Stephens >Priority: Major > Fix For: 2.4.0 > > Attachments: word-doc-with-image-from-word-365.docx, > word-doc-with-image.docx > > > The attached Word document includes nothing but a single image. Running it > through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it > through the Tika 2.3.0 AutoDetectParser returns the text: > {{image1.png}} > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text
[ https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17519857#comment-17519857 ] Luís Filipe Nassif commented on TIKA-3711: -- Great! Thank you [~tallison] for all your hard work with Tika. And sorry if my comments made things harder to be implemented. > Image file names included in parsed Word Document text > -- > > Key: TIKA-3711 > URL: https://issues.apache.org/jira/browse/TIKA-3711 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.3.0 >Reporter: Sam Stephens >Priority: Major > Fix For: 2.4.0 > > Attachments: word-doc-with-image-from-word-365.docx, > word-doc-with-image.docx > > > The attached Word document includes nothing but a single image. Running it > through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it > through the Tika 2.3.0 AutoDetectParser returns the text: > {{image1.png}} > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text
[ https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17519802#comment-17519802 ] Hudson commented on TIKA-3711: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #512 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/512/]) TIKA-3711 -- allow configuration of EmbeddedDocumentExtractors via tika-config.xml (tallison: [https://github.com/apache/tika/commit/ccc7bd841e097c3aa6d0c7c8494ddc5fa7596619]) * (edit) tika-core/src/main/java/org/apache/tika/extractor/ParsingEmbeddedDocumentExtractor.java * (edit) tika-core/src/main/java/org/apache/tika/parser/AutoDetectParserConfig.java * (add) tika-core/src/main/java/org/apache/tika/extractor/ParsingEmbeddedDocumentExtractorFactory.java * (edit) tika-core/src/main/java/org/apache/tika/parser/AutoDetectParser.java * (edit) tika-core/src/main/java/org/apache/tika/extractor/EmbeddedDocumentUtil.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java * (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/configs/tika-config-with-names.xml * (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/configs/tika-config-no-names.xml * (add) tika-core/src/main/java/org/apache/tika/extractor/EmbeddedDocumentExtractorFactory.java TIKA-3711 -- allow configuration of EmbeddedDocumentExtractors via tika-config.xml -- review and correct places where outputHtml should be false. (tallison: [https://github.com/apache/tika/commit/6552b076f0b4987423710b72b8917150422ea112]) * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/pst/OutlookPSTParser.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/pkg/ZipParserTest.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/onenote/OneNoteTreeWalker.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/xml/WordMLParser.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/AbstractPOIFSExtractor.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/JackcessExtractor.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/microsoft/XML2003ParserTest.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/ImageGraphicsEngine.java * (edit) CHANGES.txt * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java > Image file names included in parsed Word Document text > -- > > Key: TIKA-3711 > URL: https://issues.apache.org/jira/browse/TIKA-3711 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.3.0 >Reporter: Sam Stephens >Priority: Major > Fix For: 2.4.0 > > Attachments: word-doc-with-image-from-word-365.docx, > word-doc-with-image.docx > > > The attached Word document includes nothing but a single image. Running it > through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it > through the Tika 2.3.0 AutoDetectParser returns the
[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text
[ https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17519776#comment-17519776 ] Tim Allison commented on TIKA-3711: --- [~samstephens] and all Tika users, I'm sorry for the pain. The added configurability with the parsing embedded document extractor opens up some potential areas for useful new capabilities. > Image file names included in parsed Word Document text > -- > > Key: TIKA-3711 > URL: https://issues.apache.org/jira/browse/TIKA-3711 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 2.3.0 >Reporter: Sam Stephens >Priority: Major > Attachments: word-doc-with-image-from-word-365.docx, > word-doc-with-image.docx > > > The attached Word document includes nothing but a single image. Running it > through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it > through the Tika 2.3.0 AutoDetectParser returns the text: > {{image1.png}} > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text
[ https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17519774#comment-17519774 ] Tim Allison commented on TIKA-3711: --- In reviewing the commit above, there were quite a few places where outputHtml should have been false. I've fixed those. The general need still remained though, to allow users to turn off the reporting of embedded file names in the handler's content. I've now made the EmbeddedDocumentExtractor configurable from tika-config.xml. This is an example of how to do that: {code} false {code} > Image file names included in parsed Word Document text > -- > > Key: TIKA-3711 > URL: https://issues.apache.org/jira/browse/TIKA-3711 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 2.3.0 >Reporter: Sam Stephens >Priority: Major > Attachments: word-doc-with-image-from-word-365.docx, > word-doc-with-image.docx > > > The attached Word document includes nothing but a single image. Running it > through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it > through the Tika 2.3.0 AutoDetectParser returns the text: > {{image1.png}} > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text
[ https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17519749#comment-17519749 ] Tim Allison commented on TIKA-3711: --- I'm going to address this in two commits. The first will add configurability for writing the file names to streams. The second commit will be a review of the offending commit: https://github.com/apache/tika/commit/118734a1317fa13ad66959fdc28969ca50a49643 -- I need to review cases where the calling parser has already written xhtml tags. > Image file names included in parsed Word Document text > -- > > Key: TIKA-3711 > URL: https://issues.apache.org/jira/browse/TIKA-3711 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 2.3.0 >Reporter: Sam Stephens >Priority: Major > Attachments: word-doc-with-image-from-word-365.docx, > word-doc-with-image.docx > > > The attached Word document includes nothing but a single image. Running it > through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it > through the Tika 2.3.0 AutoDetectParser returns the text: > {{image1.png}} > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text
[ https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17517385#comment-17517385 ] Tim Allison commented on TIKA-3711: --- Y, we need to make this configurable. I'll have time to work on this later this week. > Image file names included in parsed Word Document text > -- > > Key: TIKA-3711 > URL: https://issues.apache.org/jira/browse/TIKA-3711 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 2.3.0 >Reporter: Sam Stephens >Priority: Minor > Attachments: word-doc-with-image-from-word-365.docx, > word-doc-with-image.docx > > > The attached Word document includes nothing but a single image. Running it > through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it > through the Tika 2.3.0 AutoDetectParser returns the text: > {{image1.png}} > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text
[ https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17517167#comment-17517167 ] Sam Stephens commented on TIKA-3711: Regarding filenames, I don't think they will ever be semantically meaningful. I just created a document with Word 365 (uploaded as word-doc-with-image-from-word-365.docx), added a picture with the filename test-image.png, and the extracted filename is still image1.png. I think Word is creating non-interesting filenames. As far as breaking other users, I'm raising this bug because this *is* a change in behavior that's broken me. I'm relying on the Tika 2.2.0 behavior where images are not part of the text. Using the ToXMLContentHandler to look at the actual generated HTML, I think that's also going to be surprising behavior for most users. {{}} {{image.png}} {{}} My image actually has alt text; that's not included. And I think including the image file name as a header in the markup is going to be surprising to almost every user. It certainly doesn't match the source document (which has no headers, or visible text of any kind). As an end user, what I'd like is the XHTML to be {{}} {{}} And the text from BodyContentHandler to not include the image at all. That way the text is the text, and if I have an interest in Image alt tags, I can operate on the XHTML. If you wanted to include an option to provide text for the image, I don't think image filenames will ever be useful from Word; alt text is the right place semantically to be looking for a textual representation of an image. > Image file names included in parsed Word Document text > -- > > Key: TIKA-3711 > URL: https://issues.apache.org/jira/browse/TIKA-3711 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 2.3.0 >Reporter: Sam Stephens >Priority: Minor > Attachments: word-doc-with-image-from-word-365.docx, > word-doc-with-image.docx > > > The attached Word document includes nothing but a single image. Running it > through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it > through the Tika 2.3.0 AutoDetectParser returns the text: > {{image1.png}} > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text
[ https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516459#comment-17516459 ] Nick Burch commented on TIKA-3711: -- I'd lean towards putting the file name as an attribute of the img tag, along with the description as the alt text if the format supports it > Image file names included in parsed Word Document text > -- > > Key: TIKA-3711 > URL: https://issues.apache.org/jira/browse/TIKA-3711 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 2.3.0 >Reporter: Sam Stephens >Priority: Minor > Attachments: word-doc-with-image.docx > > > The attached Word document includes nothing but a single image. Running it > through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it > through the Tika 2.3.0 AutoDetectParser returns the text: > {{image1.png}} > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text
[ https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516385#comment-17516385 ] Luís Filipe Nassif commented on TIKA-3711: -- Well, when reading the document in its native format, users will see the embedded images, but of course it's not text. Regarding filenames, "image1", "image2" may not be useful, but "bank transfer receipt", "qrcode for payment" may be very useful... > I don't think more information for its own sake is necessarily good It's not, that's why I said this is use case specific... In our project we already have our own content handler to output embedded filenames since a long ago, so this change wouldn't affect us. But my point about supressing current (intended) output info, even making sense, is that it could break other users, so my weak suggestion is to have an option to enable previous behavior. > Image file names included in parsed Word Document text > -- > > Key: TIKA-3711 > URL: https://issues.apache.org/jira/browse/TIKA-3711 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 2.3.0 >Reporter: Sam Stephens >Priority: Minor > Attachments: word-doc-with-image.docx > > > The attached Word document includes nothing but a single image. Running it > through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it > through the Tika 2.3.0 AutoDetectParser returns the text: > {{image1.png}} > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text
[ https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516141#comment-17516141 ] Sam Stephens commented on TIKA-3711: I guess the question is what are the semantics of this operation? When I ask for the text of a document, what does that actually mean? As an end user, I'd argue the semantics that are most useful to end users is that getting the text of a document provides the closest possible representation of the text a user would read when reading the document in its native format. By this argument, the image filenames should not be there, because I wouldn't see image filenames if I was reading the Word document from within Word. I don't think more information for its own sake is necessarily good. If I argue this from a reductio ad absurdum perspective, I'd then say that adding text describing all document formatting is useful. Adding the words "Heading 1" each time there's a heading, "Bold" and "Unbold" each time a bolded section occurs. This is clearly more information, but it's also clear that adding this information would rapidly make the text you extracted from a Word document unusable. >From an end user perspective, I'm using this text extraction so I can put >documents in a search index. Having the terms "image1", "image2" etc show up >in my index for documents that contain images is not useful behavior, unless >that actually occurs in the real text of the document. The image filenames are metadata. If I wanted that metadata, I can engage with the full XHTML representation of the document to get it. But my take is that BodyContentHandler should give me text, not metadata. > Image file names included in parsed Word Document text > -- > > Key: TIKA-3711 > URL: https://issues.apache.org/jira/browse/TIKA-3711 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 2.3.0 >Reporter: Sam Stephens >Priority: Minor > Attachments: word-doc-with-image.docx > > > The attached Word document includes nothing but a single image. Running it > through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it > through the Tika 2.3.0 AutoDetectParser returns the text: > {{image1.png}} > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text
[ https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516029#comment-17516029 ] Tim Allison commented on TIKA-3711: --- Y, I totally hear you [~lfcnassif]. I think we have areas for improvement in configuring the handlers. We currently have no way to specify which contenthandler is selected in tika-config.xml. We have no way to specify configurations for embedded document parsers (aside from the write limits etc.). > Image file names included in parsed Word Document text > -- > > Key: TIKA-3711 > URL: https://issues.apache.org/jira/browse/TIKA-3711 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 2.3.0 >Reporter: Sam Stephens >Priority: Minor > Attachments: word-doc-with-image.docx > > > The attached Word document includes nothing but a single image. Running it > through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it > through the Tika 2.3.0 AutoDetectParser returns the text: > {{image1.png}} > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text
[ https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17515974#comment-17515974 ] Luís Filipe Nassif commented on TIKA-3711: -- Well, IMHO I think the user may be interested to check the image in the original document if he/she knows there is some (non ocred) image in a specific text location. This for sure could be done with a specific xml handler impl like you said, but if current behavior are going to be changed to suppress current text output, maybe an option to enable it again could help... > Image file names included in parsed Word Document text > -- > > Key: TIKA-3711 > URL: https://issues.apache.org/jira/browse/TIKA-3711 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 2.3.0 >Reporter: Sam Stephens >Priority: Minor > Attachments: word-doc-with-image.docx > > > The attached Word document includes nothing but a single image. Running it > through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it > through the Tika 2.3.0 AutoDetectParser returns the text: > {{image1.png}} > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text
[ https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17515956#comment-17515956 ] Tim Allison commented on TIKA-3711: --- [~lfcnassif], it feels inelegant to write embedded filenames in the text. Would it not be enough to include that in the xhtml markup? > Image file names included in parsed Word Document text > -- > > Key: TIKA-3711 > URL: https://issues.apache.org/jira/browse/TIKA-3711 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 2.3.0 >Reporter: Sam Stephens >Priority: Minor > Attachments: word-doc-with-image.docx > > > The attached Word document includes nothing but a single image. Running it > through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it > through the Tika 2.3.0 AutoDetectParser returns the text: > {{image1.png}} > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text
[ https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17515941#comment-17515941 ] Luís Filipe Nassif commented on TIKA-3711: -- I strongly prefer current behavior, that returns more information to the user. Of course this is use case specific and maybe an option could exist to disable that. > Image file names included in parsed Word Document text > -- > > Key: TIKA-3711 > URL: https://issues.apache.org/jira/browse/TIKA-3711 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.3.0 >Reporter: Sam Stephens >Priority: Major > Attachments: word-doc-with-image.docx > > > The attached Word document includes nothing but a single image. Running it > through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it > through the Tika 2.3.0 AutoDetectParser returns the text: > {{image1.png}} > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text
[ https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17515919#comment-17515919 ] Tim Allison commented on TIKA-3711: --- I introduced that change because some parsers were including it and some were not. So we had different behavior for different file types, which was less than ideal. I included this bullet in the CHANGES.txt file as an alert to changed behavior: bq.* Improve consistency in reporting package-entry divs across all parsers for embedded files (TIKA-3644). This leads to some more text (embedded file names) in files with many embedded attachments. We can change the behavior to "include the file name only in xhtml attributes" which will not show up in text. But we should do that consistently for all file types. Fellow devs, what do you think? > Image file names included in parsed Word Document text > -- > > Key: TIKA-3711 > URL: https://issues.apache.org/jira/browse/TIKA-3711 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.3.0 >Reporter: Sam Stephens >Priority: Major > Attachments: word-doc-with-image.docx > > > The attached Word document includes nothing but a single image. Running it > through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it > through the Tika 2.3.0 AutoDetectParser returns the text: > {{image1.png}} > -- This message was sent by Atlassian Jira (v8.20.1#820001)