[jira] [Commented] (TIKA-3768) message/rfc822 does not include Headers in extracted text
[ https://issues.apache.org/jira/browse/TIKA-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17550216#comment-17550216 ] Nick Burch commented on TIKA-3768: -- I wouldn't expect to find those in the textual content after parsing, those fields should be ending up in the Metadata object instead We have a bunch of unit tests for mail parsing which shows that, for our test files at least, that subject + from + to all coming through, see [https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java] Are you able to compare your code with that in the unit test, and see any differences between the working test and yours? Bonus marks if you can write a small failing junit unit test that shows the issue with your file > message/rfc822 does not include Headers in extracted text > - > > Key: TIKA-3768 > URL: https://issues.apache.org/jira/browse/TIKA-3768 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.4.0 >Reporter: Sam Stephens >Priority: Major > Attachments: email.txt > > > When running AutoDetectParser on message/rfc822 structured text documents, > such as the attached [^email.txt], the extracted text does not include any of > the headers, such as the Subject and From and To lines. > However these lines contain useful text I'd like to be able to extract. I'm > surprised it's not there based on the include everything bias I saw on > https://issues.apache.org/jira/browse/TIKA-3710. > Interestingly, if I exclude org.apache.tika.parser.mail.RFC822Parser as a > parser, my debugging appears to show > org.apache.tika.parser.csv.TextAndCSVParser being used for parsing, and we > get the full text, but the returned content type is 'message/rfc822; > charset=windows-1252'. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (TIKA-3768) message/rfc822 does not include Headers in extracted text
[ https://issues.apache.org/jira/browse/TIKA-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551871#comment-17551871 ] Sam Stephens commented on TIKA-3768: Ah, interesting, this is a case of me misunderstanding the product then. This means that in order to actually get all the text possible out of a file, I need to examine both the actual text and the metadata (I'm using this for building a search over documents of many types). The challenge then is that some fields in the metadata object are sourced from text in the document (such as {{dc:subject}} and {{{}Message-From{}}}) and should be searchable, and some that are not (such as {{Content-Type}} and {{{}X-TIKA:Parsed-By{}}}), and should not be searchable. Is there any documentation of the set of possible metadata fields? The constants inherited by [https://tika.apache.org/2.4.0/api/org/apache/tika/metadata/Metadata.html] don't appear to be a complete set, as I don't see {{dc:subject}} amongst them. It looks to me like I could strip out fields like {{Content-Type}} as listed in [https://tika.apache.org/2.4.0/api/org/apache/tika/metadata/Metadata.html] and any fields with names prefixed by {{{}X-TIKA:{}}}, and all remaining fields would be sourced from document text. > message/rfc822 does not include Headers in extracted text > - > > Key: TIKA-3768 > URL: https://issues.apache.org/jira/browse/TIKA-3768 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.4.0 >Reporter: Sam Stephens >Priority: Major > Attachments: email.txt > > > When running AutoDetectParser on message/rfc822 structured text documents, > such as the attached [^email.txt], the extracted text does not include any of > the headers, such as the Subject and From and To lines. > However these lines contain useful text I'd like to be able to extract. I'm > surprised it's not there based on the include everything bias I saw on > https://issues.apache.org/jira/browse/TIKA-3710. > Interestingly, if I exclude org.apache.tika.parser.mail.RFC822Parser as a > parser, my debugging appears to show > org.apache.tika.parser.csv.TextAndCSVParser being used for parsing, and we > get the full text, but the returned content type is 'message/rfc822; > charset=windows-1252'. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (TIKA-3768) message/rfc822 does not include Headers in extracted text
[ https://issues.apache.org/jira/browse/TIKA-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552078#comment-17552078 ] Nick Burch commented on TIKA-3768: -- If we can put something into a properly typed + structured metadata field, we will! The full list of metadata property definitions are spread across the interface in [https://tika.apache.org/2.4.0/api/org/apache/tika/metadata/package-summary.html] grouped by type. Wherever possible we re-use existing well known definitions While we always store the metadata values as strings, the definition properties will help you turn it back into the underlying java types, eg get the date back as a java Date > message/rfc822 does not include Headers in extracted text > - > > Key: TIKA-3768 > URL: https://issues.apache.org/jira/browse/TIKA-3768 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.4.0 >Reporter: Sam Stephens >Priority: Major > Attachments: email.txt > > > When running AutoDetectParser on message/rfc822 structured text documents, > such as the attached [^email.txt], the extracted text does not include any of > the headers, such as the Subject and From and To lines. > However these lines contain useful text I'd like to be able to extract. I'm > surprised it's not there based on the include everything bias I saw on > https://issues.apache.org/jira/browse/TIKA-3710. > Interestingly, if I exclude org.apache.tika.parser.mail.RFC822Parser as a > parser, my debugging appears to show > org.apache.tika.parser.csv.TextAndCSVParser being used for parsing, and we > get the full text, but the returned content type is 'message/rfc822; > charset=windows-1252'. -- This message was sent by Atlassian Jira (v8.20.7#820007)