[ 
https://issues.apache.org/jira/browse/TIKA-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551871#comment-17551871
 ] 

Sam Stephens commented on TIKA-3768:
------------------------------------

Ah, interesting, this is a case of me misunderstanding the product then.

This means that in order to actually get all the text possible out of a file, I 
need to examine both the actual text and the metadata (I'm using this for 
building a search over documents of many types).

The challenge then is that some fields in the metadata object are sourced from 
text in the document (such as {{dc:subject}} and {{{}Message-From{}}}) and 
should be searchable, and some that are not (such as {{Content-Type}} and 
{{{}X-TIKA:Parsed-By{}}}), and should not be searchable.

Is there any documentation of the set of possible metadata fields? The 
constants inherited by 
[https://tika.apache.org/2.4.0/api/org/apache/tika/metadata/Metadata.html] 
don't appear to be a complete set, as I don't see {{dc:subject}} amongst them.

It looks to me like I could strip out fields like {{Content-Type}} as listed in 
[https://tika.apache.org/2.4.0/api/org/apache/tika/metadata/Metadata.html] and 
any fields with names prefixed by {{{}X-TIKA:{}}}, and all remaining fields 
would be sourced from document text.

> message/rfc822 does not include Headers in extracted text
> ---------------------------------------------------------
>
>                 Key: TIKA-3768
>                 URL: https://issues.apache.org/jira/browse/TIKA-3768
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.4.0
>            Reporter: Sam Stephens
>            Priority: Major
>         Attachments: email.txt
>
>
> When running AutoDetectParser on message/rfc822 structured text documents, 
> such as the attached [^email.txt], the extracted text does not include any of 
> the headers, such as the Subject and From and To lines.
> However these lines contain useful text I'd like to be able to extract. I'm 
> surprised it's not there based on the include everything bias I saw on 
> https://issues.apache.org/jira/browse/TIKA-3710.
> Interestingly, if I exclude org.apache.tika.parser.mail.RFC822Parser as a 
> parser, my debugging appears to show 
> org.apache.tika.parser.csv.TextAndCSVParser being used for parsing, and we 
> get the full text, but the returned content type is 'message/rfc822; 
> charset=windows-1252'.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to