[ https://issues.apache.org/jira/browse/TIKA-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551871#comment-17551871 ]
Sam Stephens commented on TIKA-3768: ------------------------------------ Ah, interesting, this is a case of me misunderstanding the product then. This means that in order to actually get all the text possible out of a file, I need to examine both the actual text and the metadata (I'm using this for building a search over documents of many types). The challenge then is that some fields in the metadata object are sourced from text in the document (such as {{dc:subject}} and {{{}Message-From{}}}) and should be searchable, and some that are not (such as {{Content-Type}} and {{{}X-TIKA:Parsed-By{}}}), and should not be searchable. Is there any documentation of the set of possible metadata fields? The constants inherited by [https://tika.apache.org/2.4.0/api/org/apache/tika/metadata/Metadata.html] don't appear to be a complete set, as I don't see {{dc:subject}} amongst them. It looks to me like I could strip out fields like {{Content-Type}} as listed in [https://tika.apache.org/2.4.0/api/org/apache/tika/metadata/Metadata.html] and any fields with names prefixed by {{{}X-TIKA:{}}}, and all remaining fields would be sourced from document text. > message/rfc822 does not include Headers in extracted text > --------------------------------------------------------- > > Key: TIKA-3768 > URL: https://issues.apache.org/jira/browse/TIKA-3768 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 2.4.0 > Reporter: Sam Stephens > Priority: Major > Attachments: email.txt > > > When running AutoDetectParser on message/rfc822 structured text documents, > such as the attached [^email.txt], the extracted text does not include any of > the headers, such as the Subject and From and To lines. > However these lines contain useful text I'd like to be able to extract. I'm > surprised it's not there based on the include everything bias I saw on > https://issues.apache.org/jira/browse/TIKA-3710. > Interestingly, if I exclude org.apache.tika.parser.mail.RFC822Parser as a > parser, my debugging appears to show > org.apache.tika.parser.csv.TextAndCSVParser being used for parsing, and we > get the full text, but the returned content type is 'message/rfc822; > charset=windows-1252'. -- This message was sent by Atlassian Jira (v8.20.7#820007)