[ 
https://issues.apache.org/jira/browse/TIKA-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Stephens updated TIKA-3768:
-------------------------------
    Description: 
When running AutoDetectParser on message/rfc822 structured text documents, such 
as the attached [^email.txt], the extracted text does not include any of the 
headers, such as the Subject and From and To lines.

However these lines contain useful text I'd like to be able to extract. I'm 
surprised it's not there based on the include everything bias I saw on 
https://issues.apache.org/jira/browse/TIKA-3710.

Interestingly, if I exclude org.apache.tika.parser.mail.RFC822Parser as a 
parser, my debugging appears to show 
org.apache.tika.parser.csv.TextAndCSVParser being used for parsing, and we get 
the full text, but the returned content type is 'message/rfc822; 
charset=windows-1252'.

  was:
When running AutoDetectParser on message/rfc822 structured text documents, such 
as the attached [^email.txt], the extracted text does not include any of the 
headers, such as the Subject and From and To lines.

However these lines contain useful text I'd like to be able to extract. I'm 
surprised it's not there based on the include everything bias I saw on 
https://issues.apache.org/jira/browse/TIKA-3710.


> message/rfc822 does not include Headers in extracted text
> ---------------------------------------------------------
>
>                 Key: TIKA-3768
>                 URL: https://issues.apache.org/jira/browse/TIKA-3768
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.4.0
>            Reporter: Sam Stephens
>            Priority: Major
>         Attachments: email.txt
>
>
> When running AutoDetectParser on message/rfc822 structured text documents, 
> such as the attached [^email.txt], the extracted text does not include any of 
> the headers, such as the Subject and From and To lines.
> However these lines contain useful text I'd like to be able to extract. I'm 
> surprised it's not there based on the include everything bias I saw on 
> https://issues.apache.org/jira/browse/TIKA-3710.
> Interestingly, if I exclude org.apache.tika.parser.mail.RFC822Parser as a 
> parser, my debugging appears to show 
> org.apache.tika.parser.csv.TextAndCSVParser being used for parsing, and we 
> get the full text, but the returned content type is 'message/rfc822; 
> charset=windows-1252'.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to