[jira] [Commented] (TIKA-4167) CONTENT_TYPE_USER_OVERRIDE doesn't force content type for application/illustrator files
[ https://issues.apache.org/jira/browse/TIKA-4167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17783431#comment-17783431 ] Sam Stephens commented on TIKA-4167: Thanks Tim. I don't have a use case where I need this behavior. But I had a test for the invariant that when CONTENT_TYPE_USER_OVERRIDE is provided, the "Content-Type" in the returned metadata reflects that, and I was surprised to see that invariant violated. I think this is probably a documentation issue. Reading https://tika.apache.org/2.9.1/detection.html#The_Detector_Interface, I didn't understand that CONTENT_TYPE_USER_OVERRIDE only applies to detection, but the parser the override selects then has freedom to chose the content type it thinks is most appropriate. You could consider making this a little clearer in that documentation. For my specific use case, regardless of whether my incoming file has an explicit content type use for CONTENT_TYPE_USER_OVERRIDE, or whether I'm using auto-detection, I always want to detect PDFs as application/pdf, I have no interest in or use for the subtypes. But I think it's reasonable for that behavior to be something I implement by post-processing the returned Tika metadata (basically if Content-Type is application/illustrator, and dc:format starts with application/pdf, I use application/pdf as the content type). > CONTENT_TYPE_USER_OVERRIDE doesn't force content type for > application/illustrator files > --- > > Key: TIKA-4167 > URL: https://issues.apache.org/jira/browse/TIKA-4167 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.1 >Reporter: Sam Stephens >Priority: Minor > > When I parse a file using AutoDetectParser, with Metadata set to > {color:#ce9178}{TikaCoreProperties.CONTENT_TYPE_USER_OVERRIDE: > "application/pdf"}{color} > and parse [a PDF-like Illustrator > file|[https://github.com/apache/tika/blob/78be82565df4cc3bbc88308be8d686019a10b899/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/resources/test-documents/testPDF_AdobeIllustrator.pdf],] > the "Content-Type" in the returned metadata is "application/illustrator", > not "application/pdf". > I think this is happening because "application/illustrator" is a subtype of > "application/pdf". -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4167) CONTENT_TYPE_USER_OVERRIDE doesn't force content type for application/illustrator files
Sam Stephens created TIKA-4167: -- Summary: CONTENT_TYPE_USER_OVERRIDE doesn't force content type for application/illustrator files Key: TIKA-4167 URL: https://issues.apache.org/jira/browse/TIKA-4167 Project: Tika Issue Type: Bug Components: parser Affects Versions: 2.9.1 Reporter: Sam Stephens When I parse a file using AutoDetectParser, with Metadata set to {color:#ce9178}{TikaCoreProperties.CONTENT_TYPE_USER_OVERRIDE: "application/pdf"}{color} and parse [a PDF-like Illustrator file|[https://github.com/apache/tika/blob/78be82565df4cc3bbc88308be8d686019a10b899/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/resources/test-documents/testPDF_AdobeIllustrator.pdf],] the "Content-Type" in the returned metadata is "application/illustrator", not "application/pdf". I think this is happening because "application/illustrator" is a subtype of "application/pdf". -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3768) message/rfc822 does not include Headers in extracted text
[ https://issues.apache.org/jira/browse/TIKA-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551871#comment-17551871 ] Sam Stephens commented on TIKA-3768: Ah, interesting, this is a case of me misunderstanding the product then. This means that in order to actually get all the text possible out of a file, I need to examine both the actual text and the metadata (I'm using this for building a search over documents of many types). The challenge then is that some fields in the metadata object are sourced from text in the document (such as {{dc:subject}} and {{{}Message-From{}}}) and should be searchable, and some that are not (such as {{Content-Type}} and {{{}X-TIKA:Parsed-By{}}}), and should not be searchable. Is there any documentation of the set of possible metadata fields? The constants inherited by [https://tika.apache.org/2.4.0/api/org/apache/tika/metadata/Metadata.html] don't appear to be a complete set, as I don't see {{dc:subject}} amongst them. It looks to me like I could strip out fields like {{Content-Type}} as listed in [https://tika.apache.org/2.4.0/api/org/apache/tika/metadata/Metadata.html] and any fields with names prefixed by {{{}X-TIKA:{}}}, and all remaining fields would be sourced from document text. > message/rfc822 does not include Headers in extracted text > - > > Key: TIKA-3768 > URL: https://issues.apache.org/jira/browse/TIKA-3768 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.4.0 >Reporter: Sam Stephens >Priority: Major > Attachments: email.txt > > > When running AutoDetectParser on message/rfc822 structured text documents, > such as the attached [^email.txt], the extracted text does not include any of > the headers, such as the Subject and From and To lines. > However these lines contain useful text I'd like to be able to extract. I'm > surprised it's not there based on the include everything bias I saw on > https://issues.apache.org/jira/browse/TIKA-3710. > Interestingly, if I exclude org.apache.tika.parser.mail.RFC822Parser as a > parser, my debugging appears to show > org.apache.tika.parser.csv.TextAndCSVParser being used for parsing, and we > get the full text, but the returned content type is 'message/rfc822; > charset=windows-1252'. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (TIKA-3769) md5 incorrectly detected as application/marc
[ https://issues.apache.org/jira/browse/TIKA-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539138#comment-17539138 ] Sam Stephens commented on TIKA-3769: Thanks for the prompt fix! > md5 incorrectly detected as application/marc > > > Key: TIKA-3769 > URL: https://issues.apache.org/jira/browse/TIKA-3769 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.4.0 >Reporter: Sam Stephens >Priority: Major > Fix For: 2.4.1 > > Attachments: md5.txt > > > When I parse the attached text document using AutoDetectParser, its > incorrectly detected as application/marc with no text. As other md5s I > generated randomly correctly detected as text, I'm guessing that the Marc > parser is using some kind of magic bytes to detect Marc files that this file > matches as a false positive. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822
[ https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539054#comment-17539054 ] Sam Stephens commented on TIKA-3710: {quote}The h1 isn't quite as unique as we might like, and maybe not as good as some of the other ones {quote} Honestly, I'm not so worried about the HTML fragment detection, because that's never going to be perfect. A bare text string without any HTML tags is technically an HTML fragment. In the modern world where people can and do define their own HTML tags, you _could_ say that any file opens with a valid tag as defined by the W3C is HTML, but that feels open to false positives. > HTML document detected incorrect as message/rfc822 > -- > > Key: TIKA-3710 > URL: https://issues.apache.org/jira/browse/TIKA-3710 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 2.3.0 >Reporter: Sam Stephens >Priority: Major > Attachments: html-that-looks-like-rfc822.html > > > I'm detecting content types and extracting text from documents using the > AutoDetectParser. > I've received some documents that are HTML fragments generated from emails. > The documents are clearly HTML, not emails, but the AutoDetectParser gives me > the MIME type message/rfc822 and no text. I've attached an example. > It looks like the presence of From:, Sent:, and Subject: at the beginning of > lines is why the documents are matching RFC822. However, I believe the > presence of HTML before these headers means the document is not valid RFC822. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822
[ https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539051#comment-17539051 ] Sam Stephens commented on TIKA-3710: Is it valid for a message/rfc822 message to have a bunch of preamble like the HTML tags in my document before the headers? Is the answer that the RFC822 detection here is too loose, and the non-header material at the beginning of my file should result in the message/rfc822 parser rejecting it? > HTML document detected incorrect as message/rfc822 > -- > > Key: TIKA-3710 > URL: https://issues.apache.org/jira/browse/TIKA-3710 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 2.3.0 >Reporter: Sam Stephens >Priority: Major > Attachments: html-that-looks-like-rfc822.html > > > I'm detecting content types and extracting text from documents using the > AutoDetectParser. > I've received some documents that are HTML fragments generated from emails. > The documents are clearly HTML, not emails, but the AutoDetectParser gives me > the MIME type message/rfc822 and no text. I've attached an example. > It looks like the presence of From:, Sent:, and Subject: at the beginning of > lines is why the documents are matching RFC822. However, I believe the > presence of HTML before these headers means the document is not valid RFC822. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (TIKA-3769) md5 incorrectly detected as application/marc
Sam Stephens created TIKA-3769: -- Summary: md5 incorrectly detected as application/marc Key: TIKA-3769 URL: https://issues.apache.org/jira/browse/TIKA-3769 Project: Tika Issue Type: Bug Components: parser Affects Versions: 2.4.0 Reporter: Sam Stephens Attachments: md5.txt When I parse the attached text document using AutoDetectParser, its incorrectly detected as application/marc with no text. As other md5s I generated randomly correctly detected as text, I'm guessing that the Marc parser is using some kind of magic bytes to detect Marc files that this file matches as a false positive. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text
[ https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538528#comment-17538528 ] Sam Stephens commented on TIKA-3711: Thanks [~tallison] , confirmed this is working for me as expected. > Image file names included in parsed Word Document text > -- > > Key: TIKA-3711 > URL: https://issues.apache.org/jira/browse/TIKA-3711 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.3.0 >Reporter: Sam Stephens >Priority: Major > Fix For: 2.4.0 > > Attachments: word-doc-with-image-from-word-365.docx, > word-doc-with-image.docx > > > The attached Word document includes nothing but a single image. Running it > through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it > through the Tika 2.3.0 AutoDetectParser returns the text: > {{image1.png}} > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822
[ https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538524#comment-17538524 ] Sam Stephens commented on TIKA-3710: Note that I exclude org.apache.tika.parser.mail.RFC822Parser as a parser, my debugging appears to show org.apache.tika.parser.csv.TextAndCSVParser being used for parsing, we get the full raw text of the document, including HTML tags, and the returned content type is 'message/rfc822; charset=ISO-8859-1'. > HTML document detected incorrect as message/rfc822 > -- > > Key: TIKA-3710 > URL: https://issues.apache.org/jira/browse/TIKA-3710 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 2.3.0 >Reporter: Sam Stephens >Priority: Major > Attachments: html-that-looks-like-rfc822.html > > > I'm detecting content types and extracting text from documents using the > AutoDetectParser. > I've received some documents that are HTML fragments generated from emails. > The documents are clearly HTML, not emails, but the AutoDetectParser gives me > the MIME type message/rfc822 and no text. I've attached an example. > It looks like the presence of From:, Sent:, and Subject: at the beginning of > lines is why the documents are matching RFC822. However, I believe the > presence of HTML before these headers means the document is not valid RFC822. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (TIKA-3768) message/rfc822 does not include Headers in extracted text
[ https://issues.apache.org/jira/browse/TIKA-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sam Stephens updated TIKA-3768: --- Description: When running AutoDetectParser on message/rfc822 structured text documents, such as the attached [^email.txt], the extracted text does not include any of the headers, such as the Subject and From and To lines. However these lines contain useful text I'd like to be able to extract. I'm surprised it's not there based on the include everything bias I saw on https://issues.apache.org/jira/browse/TIKA-3710. Interestingly, if I exclude org.apache.tika.parser.mail.RFC822Parser as a parser, my debugging appears to show org.apache.tika.parser.csv.TextAndCSVParser being used for parsing, and we get the full text, but the returned content type is 'message/rfc822; charset=windows-1252'. was: When running AutoDetectParser on message/rfc822 structured text documents, such as the attached [^email.txt], the extracted text does not include any of the headers, such as the Subject and From and To lines. However these lines contain useful text I'd like to be able to extract. I'm surprised it's not there based on the include everything bias I saw on https://issues.apache.org/jira/browse/TIKA-3710. > message/rfc822 does not include Headers in extracted text > - > > Key: TIKA-3768 > URL: https://issues.apache.org/jira/browse/TIKA-3768 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.4.0 >Reporter: Sam Stephens >Priority: Major > Attachments: email.txt > > > When running AutoDetectParser on message/rfc822 structured text documents, > such as the attached [^email.txt], the extracted text does not include any of > the headers, such as the Subject and From and To lines. > However these lines contain useful text I'd like to be able to extract. I'm > surprised it's not there based on the include everything bias I saw on > https://issues.apache.org/jira/browse/TIKA-3710. > Interestingly, if I exclude org.apache.tika.parser.mail.RFC822Parser as a > parser, my debugging appears to show > org.apache.tika.parser.csv.TextAndCSVParser being used for parsing, and we > get the full text, but the returned content type is 'message/rfc822; > charset=windows-1252'. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (TIKA-3768) message/rfc822 does not include Headers in extracted text
Sam Stephens created TIKA-3768: -- Summary: message/rfc822 does not include Headers in extracted text Key: TIKA-3768 URL: https://issues.apache.org/jira/browse/TIKA-3768 Project: Tika Issue Type: Bug Components: parser Affects Versions: 2.4.0 Reporter: Sam Stephens Attachments: email.txt When running AutoDetectParser on message/rfc822 structured text documents, such as the attached [^email.txt], the extracted text does not include any of the headers, such as the Subject and From and To lines. However these lines contain useful text I'd like to be able to extract. I'm surprised it's not there based on the include everything bias I saw on https://issues.apache.org/jira/browse/TIKA-3710. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (TIKA-3666) Detect and indicate file encrypted with Rights Management Service RMS/IRM
[ https://issues.apache.org/jira/browse/TIKA-3666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521349#comment-17521349 ] Sam Stephens commented on TIKA-3666: It looks like [~4U6U57] and I have both provided POIFSViewer outputs for you. [~tallison] I hope this is useful for you. Thanks! > Detect and indicate file encrypted with Rights Management Service RMS/IRM > - > > Key: TIKA-3666 > URL: https://issues.apache.org/jira/browse/TIKA-3666 > Project: Tika > Issue Type: Improvement > Components: metadata >Reporter: August Valera >Priority: Major > Attachments: poifsviewer.txt, sam-poifsviewer.txt > > > Rights Management Service (RMS), implemented in MS Office as Information > Rights Management (IRM), allows organizations to set file permissions that > are stored within the file. In most cases, this will result in the file > getting a new extension (with a prefix p, such as {{.txt}} becoming > {{{}.ptxt{}}}), but in the case of MS Office and PDF files, which support > this natively, the implementation results in the file contents being > encrypted without any extension change. > h4. Current behavior > Running such files through Tika produces results as if it was an empty file > ran through {{DefaultParser}} and {{{}OfficeParser{}}}. > h4. Expected behavior > Extract more metadata about necessary permissions to view (if possible), and > throwing {{EncryptedDocumentException}} as is the case with Office files > encrypted in the more traditional manner. > Reference: > [https://docs.microsoft.com/en-us/azure/information-protection/rms-client/clientv2-admin-guide-file-types#supported-file-types-for-classification-and-protection] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (TIKA-3666) Detect and indicate file encrypted with Rights Management Service RMS/IRM
[ https://issues.apache.org/jira/browse/TIKA-3666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sam Stephens updated TIKA-3666: --- Attachment: sam-poifsviewer.txt > Detect and indicate file encrypted with Rights Management Service RMS/IRM > - > > Key: TIKA-3666 > URL: https://issues.apache.org/jira/browse/TIKA-3666 > Project: Tika > Issue Type: Improvement > Components: metadata >Reporter: August Valera >Priority: Major > Attachments: poifsviewer.txt, sam-poifsviewer.txt > > > Rights Management Service (RMS), implemented in MS Office as Information > Rights Management (IRM), allows organizations to set file permissions that > are stored within the file. In most cases, this will result in the file > getting a new extension (with a prefix p, such as {{.txt}} becoming > {{{}.ptxt{}}}), but in the case of MS Office and PDF files, which support > this natively, the implementation results in the file contents being > encrypted without any extension change. > h4. Current behavior > Running such files through Tika produces results as if it was an empty file > ran through {{DefaultParser}} and {{{}OfficeParser{}}}. > h4. Expected behavior > Extract more metadata about necessary permissions to view (if possible), and > throwing {{EncryptedDocumentException}} as is the case with Office files > encrypted in the more traditional manner. > Reference: > [https://docs.microsoft.com/en-us/azure/information-protection/rms-client/clientv2-admin-guide-file-types#supported-file-types-for-classification-and-protection] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3666) Detect and indicate file encrypted with Rights Management Service RMS/IRM
[ https://issues.apache.org/jira/browse/TIKA-3666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17520887#comment-17520887 ] Sam Stephens commented on TIKA-3666: [~4U6U57] did you have any luck sourcing sample files? I also want a fix for this, but unfortunately I have no way to provide a sample document, as I don't have access to the Microsoft products needed to create IRM files (the files I'm having issues with are sourced externally). Thanks for reporting this issue! > Detect and indicate file encrypted with Rights Management Service RMS/IRM > - > > Key: TIKA-3666 > URL: https://issues.apache.org/jira/browse/TIKA-3666 > Project: Tika > Issue Type: Improvement > Components: metadata >Reporter: August Valera >Priority: Major > > Rights Management Service (RMS), implemented in MS Office as Information > Rights Management (IRM), allows organizations to set file permissions that > are stored within the file. In most cases, this will result in the file > getting a new extension (with a prefix p, such as {{.txt}} becoming > {{{}.ptxt{}}}), but in the case of MS Office and PDF files, which support > this natively, the implementation results in the file contents being > encrypted without any extension change. > h4. Current behavior > Running such files through Tika produces results as if it was an empty file > ran through {{DefaultParser}} and {{{}OfficeParser{}}}. > h4. Expected behavior > Extract more metadata about necessary permissions to view (if possible), and > throwing {{EncryptedDocumentException}} as is the case with Office files > encrypted in the more traditional manner. > Reference: > [https://docs.microsoft.com/en-us/azure/information-protection/rms-client/clientv2-admin-guide-file-types#supported-file-types-for-classification-and-protection] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (TIKA-3711) Image file names included in parsed Word Document text
[ https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517167#comment-17517167 ] Sam Stephens edited comment on TIKA-3711 at 4/5/22 10:00 PM: - Regarding filenames, I don't think they will ever be semantically meaningful. I just created a document with Word 365 (uploaded as word-doc-with-image-from-word-365.docx), added a picture with the filename test-image.png, and the extracted filename is image.png. I think Word is creating non-interesting filenames. As far as breaking other users, I'm raising this bug because this *is* a change in behavior that's broken me. I'm relying on the Tika 2.2.0 behavior where images are not part of the text. Using the ToXMLContentHandler to look at the actual generated HTML, I think that's also going to be surprising behavior for most users. {{}} {{image.png}} {{}} My image actually has alt text; that's not included. And I think including the image file name as a header in the markup is going to be surprising to almost every user. It certainly doesn't match the source document (which has no headers, or visible text of any kind). As an end user, what I'd like is the XHTML to be {{}} And the text from BodyContentHandler to not include the image at all. That way the text is the text, and if I have an interest in Image alt tags, I can operate on the XHTML. If you wanted to include an option to provide text for the image, I don't think image filenames will ever be useful from Word; alt text is the right place semantically to be looking for a textual representation of an image. was (Author: JIRAUSER287416): Regarding filenames, I don't think they will ever be semantically meaningful. I just created a document with Word 365 (uploaded as word-doc-with-image-from-word-365.docx), added a picture with the filename test-image.png, and the extracted filename is image.png. I think Word is creating non-interesting filenames. As far as breaking other users, I'm raising this bug because this *is* a change in behavior that's broken me. I'm relying on the Tika 2.2.0 behavior where images are not part of the text. Using the ToXMLContentHandler to look at the actual generated HTML, I think that's also going to be surprising behavior for most users. {{}} {{image.png}} {{}} My image actually has alt text; that's not included. And I think including the image file name as a header in the markup is going to be surprising to almost every user. It certainly doesn't match the source document (which has no headers, or visible text of any kind). As an end user, what I'd like is the XHTML to be {{}} {{}} And the text from BodyContentHandler to not include the image at all. That way the text is the text, and if I have an interest in Image alt tags, I can operate on the XHTML. If you wanted to include an option to provide text for the image, I don't think image filenames will ever be useful from Word; alt text is the right place semantically to be looking for a textual representation of an image. > Image file names included in parsed Word Document text > -- > > Key: TIKA-3711 > URL: https://issues.apache.org/jira/browse/TIKA-3711 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 2.3.0 >Reporter: Sam Stephens >Priority: Minor > Attachments: word-doc-with-image-from-word-365.docx, > word-doc-with-image.docx > > > The attached Word document includes nothing but a single image. Running it > through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it > through the Tika 2.3.0 AutoDetectParser returns the text: > {{image1.png}} > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (TIKA-3711) Image file names included in parsed Word Document text
[ https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517167#comment-17517167 ] Sam Stephens edited comment on TIKA-3711 at 4/5/22 12:25 AM: - Regarding filenames, I don't think they will ever be semantically meaningful. I just created a document with Word 365 (uploaded as word-doc-with-image-from-word-365.docx), added a picture with the filename test-image.png, and the extracted filename is image.png. I think Word is creating non-interesting filenames. As far as breaking other users, I'm raising this bug because this *is* a change in behavior that's broken me. I'm relying on the Tika 2.2.0 behavior where images are not part of the text. Using the ToXMLContentHandler to look at the actual generated HTML, I think that's also going to be surprising behavior for most users. {{}} {{image.png}} {{}} My image actually has alt text; that's not included. And I think including the image file name as a header in the markup is going to be surprising to almost every user. It certainly doesn't match the source document (which has no headers, or visible text of any kind). As an end user, what I'd like is the XHTML to be {{}} {{}} And the text from BodyContentHandler to not include the image at all. That way the text is the text, and if I have an interest in Image alt tags, I can operate on the XHTML. If you wanted to include an option to provide text for the image, I don't think image filenames will ever be useful from Word; alt text is the right place semantically to be looking for a textual representation of an image. was (Author: JIRAUSER287416): Regarding filenames, I don't think they will ever be semantically meaningful. I just created a document with Word 365 (uploaded as word-doc-with-image-from-word-365.docx), added a picture with the filename test-image.png, and the extracted filename is still image1.png. I think Word is creating non-interesting filenames. As far as breaking other users, I'm raising this bug because this *is* a change in behavior that's broken me. I'm relying on the Tika 2.2.0 behavior where images are not part of the text. Using the ToXMLContentHandler to look at the actual generated HTML, I think that's also going to be surprising behavior for most users. {{}} {{image.png}} {{}} My image actually has alt text; that's not included. And I think including the image file name as a header in the markup is going to be surprising to almost every user. It certainly doesn't match the source document (which has no headers, or visible text of any kind). As an end user, what I'd like is the XHTML to be {{}} {{}} And the text from BodyContentHandler to not include the image at all. That way the text is the text, and if I have an interest in Image alt tags, I can operate on the XHTML. If you wanted to include an option to provide text for the image, I don't think image filenames will ever be useful from Word; alt text is the right place semantically to be looking for a textual representation of an image. > Image file names included in parsed Word Document text > -- > > Key: TIKA-3711 > URL: https://issues.apache.org/jira/browse/TIKA-3711 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 2.3.0 >Reporter: Sam Stephens >Priority: Minor > Attachments: word-doc-with-image-from-word-365.docx, > word-doc-with-image.docx > > > The attached Word document includes nothing but a single image. Running it > through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it > through the Tika 2.3.0 AutoDetectParser returns the text: > {{image1.png}} > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text
[ https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517167#comment-17517167 ] Sam Stephens commented on TIKA-3711: Regarding filenames, I don't think they will ever be semantically meaningful. I just created a document with Word 365 (uploaded as word-doc-with-image-from-word-365.docx), added a picture with the filename test-image.png, and the extracted filename is still image1.png. I think Word is creating non-interesting filenames. As far as breaking other users, I'm raising this bug because this *is* a change in behavior that's broken me. I'm relying on the Tika 2.2.0 behavior where images are not part of the text. Using the ToXMLContentHandler to look at the actual generated HTML, I think that's also going to be surprising behavior for most users. {{}} {{image.png}} {{}} My image actually has alt text; that's not included. And I think including the image file name as a header in the markup is going to be surprising to almost every user. It certainly doesn't match the source document (which has no headers, or visible text of any kind). As an end user, what I'd like is the XHTML to be {{}} {{}} And the text from BodyContentHandler to not include the image at all. That way the text is the text, and if I have an interest in Image alt tags, I can operate on the XHTML. If you wanted to include an option to provide text for the image, I don't think image filenames will ever be useful from Word; alt text is the right place semantically to be looking for a textual representation of an image. > Image file names included in parsed Word Document text > -- > > Key: TIKA-3711 > URL: https://issues.apache.org/jira/browse/TIKA-3711 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 2.3.0 >Reporter: Sam Stephens >Priority: Minor > Attachments: word-doc-with-image-from-word-365.docx, > word-doc-with-image.docx > > > The attached Word document includes nothing but a single image. Running it > through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it > through the Tika 2.3.0 AutoDetectParser returns the text: > {{image1.png}} > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (TIKA-3711) Image file names included in parsed Word Document text
[ https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sam Stephens updated TIKA-3711: --- Attachment: word-doc-with-image-from-word-365.docx > Image file names included in parsed Word Document text > -- > > Key: TIKA-3711 > URL: https://issues.apache.org/jira/browse/TIKA-3711 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 2.3.0 >Reporter: Sam Stephens >Priority: Minor > Attachments: word-doc-with-image-from-word-365.docx, > word-doc-with-image.docx > > > The attached Word document includes nothing but a single image. Running it > through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it > through the Tika 2.3.0 AutoDetectParser returns the text: > {{image1.png}} > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822
[ https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516142#comment-17516142 ] Sam Stephens commented on TIKA-3710: The HTML document is exactly what you see there; these documents are fragments, not full HTML documents. However I did try wrapping the fragment in and tags to make a full document, and it was still detected as message/rfc822. > HTML document detected incorrect as message/rfc822 > -- > > Key: TIKA-3710 > URL: https://issues.apache.org/jira/browse/TIKA-3710 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 2.3.0 >Reporter: Sam Stephens >Priority: Major > Attachments: html-that-looks-like-rfc822.html > > > I'm detecting content types and extracting text from documents using the > AutoDetectParser. > I've received some documents that are HTML fragments generated from emails. > The documents are clearly HTML, not emails, but the AutoDetectParser gives me > the MIME type message/rfc822 and no text. I've attached an example. > It looks like the presence of From:, Sent:, and Subject: at the beginning of > lines is why the documents are matching RFC822. However, I believe the > presence of HTML before these headers means the document is not valid RFC822. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text
[ https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516141#comment-17516141 ] Sam Stephens commented on TIKA-3711: I guess the question is what are the semantics of this operation? When I ask for the text of a document, what does that actually mean? As an end user, I'd argue the semantics that are most useful to end users is that getting the text of a document provides the closest possible representation of the text a user would read when reading the document in its native format. By this argument, the image filenames should not be there, because I wouldn't see image filenames if I was reading the Word document from within Word. I don't think more information for its own sake is necessarily good. If I argue this from a reductio ad absurdum perspective, I'd then say that adding text describing all document formatting is useful. Adding the words "Heading 1" each time there's a heading, "Bold" and "Unbold" each time a bolded section occurs. This is clearly more information, but it's also clear that adding this information would rapidly make the text you extracted from a Word document unusable. >From an end user perspective, I'm using this text extraction so I can put >documents in a search index. Having the terms "image1", "image2" etc show up >in my index for documents that contain images is not useful behavior, unless >that actually occurs in the real text of the document. The image filenames are metadata. If I wanted that metadata, I can engage with the full XHTML representation of the document to get it. But my take is that BodyContentHandler should give me text, not metadata. > Image file names included in parsed Word Document text > -- > > Key: TIKA-3711 > URL: https://issues.apache.org/jira/browse/TIKA-3711 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 2.3.0 >Reporter: Sam Stephens >Priority: Minor > Attachments: word-doc-with-image.docx > > > The attached Word document includes nothing but a single image. Running it > through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it > through the Tika 2.3.0 AutoDetectParser returns the text: > {{image1.png}} > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (TIKA-3711) Image file names included in parsed Word Document text
[ https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sam Stephens updated TIKA-3711: --- Description: The attached Word document includes nothing but a single image. Running it through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it through the Tika 2.3.0 AutoDetectParser returns the text: {{image1.png}} was: The attached Word document includes nothing but a single image. Running it through the Tika 2.2.0 AutoDetectParser correctly returns no text. Running it through the Tika 2.3.0 AutoDetectParser returns the text: {{image1.png}} > Image file names included in parsed Word Document text > -- > > Key: TIKA-3711 > URL: https://issues.apache.org/jira/browse/TIKA-3711 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.3.0 >Reporter: Sam Stephens >Priority: Major > Attachments: word-doc-with-image.docx > > > The attached Word document includes nothing but a single image. Running it > through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it > through the Tika 2.3.0 AutoDetectParser returns the text: > {{image1.png}} > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (TIKA-3711) Image file names included in parsed Word Document text
Sam Stephens created TIKA-3711: -- Summary: Image file names included in parsed Word Document text Key: TIKA-3711 URL: https://issues.apache.org/jira/browse/TIKA-3711 Project: Tika Issue Type: Bug Components: parser Affects Versions: 2.3.0 Reporter: Sam Stephens Attachments: word-doc-with-image.docx The attached Word document includes nothing but a single image. Running it through the Tika 2.2.0 AutoDetectParser correctly returns no text. Running it through the Tika 2.3.0 AutoDetectParser returns the text: {{image1.png}} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (TIKA-3710) HTML document detected incorrect as message/rfc822
Sam Stephens created TIKA-3710: -- Summary: HTML document detected incorrect as message/rfc822 Key: TIKA-3710 URL: https://issues.apache.org/jira/browse/TIKA-3710 Project: Tika Issue Type: Bug Components: detector Affects Versions: 2.3.0 Reporter: Sam Stephens Attachments: html-that-looks-like-rfc822.html I'm detecting content types and extracting text from documents using the AutoDetectParser. I've received some documents that are HTML fragments generated from emails. The documents are clearly HTML, not emails, but the AutoDetectParser gives me the MIME type message/rfc822 and no text. I've attached an example. It looks like the presence of From:, Sent:, and Subject: at the beginning of lines is why the documents are matching RFC822. However, I believe the presence of HTML before these headers means the document is not valid RFC822. -- This message was sent by Atlassian Jira (v8.20.1#820001)