[ 
https://issues.apache.org/jira/browse/TIKA-3290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17280772#comment-17280772
 ] 

Vamsi Molli commented on TIKA-3290:
-----------------------------------

<mime-type type="message/rfc822">
 <magic priority="50">
 <!-- these should be 100% hits...if you see this at offset=0 -->
 <match value="Relay-Version:" type="stringignorecase" offset="0"/>
 <match value="#!\ rnews" type="string" offset="0"/>
 <match value="N#!\ rnews" type="string" offset="0"/>
 <match value="Forward\ to" type="string" offset="0"/>
 <match value="Pipe\ to" type="string" offset="0"/>
 <match value="Return-Path:" type="stringignorecase" offset="0"/>
 <match value="Message-ID:" type="stringignorecase" offset="0"/>
 <match value="X-Mailer:" type="string" offset="0"/>
 <match value="X-Notes-Item:" type="string" offset="0">
 <match value="Message-ID:" type="string" offset="0:8192"/>
 </match>
 <!-- be a bit more flexible, but require two of these -->
 <match minShouldMatch="2">
 <match value="Date:" type="stringignorecase" offset="0"/>
 <match value="Delivered-To:" type="string" offset="0"/>
 <match value="From:" type="stringignorecase" offset="0"/>
 <match value="Message-ID:" type="stringignorecase" offset="0"/>
 <match value="MIME-Version:" type="stringignorecase" offset="0"/>
 <match value="Received:" type="stringignorecase" offset="0"/>
 <match value="Relay-Version:" type="stringignorecase" offset="0"/>
 <match value="Return-Path:" type="stringignorecase" offset="0"/>
 <match value="Sent:" type="string" offset="0"/>
 <match value="Status:" type="string" offset="0"/>
 <match value="User-Agent:" type="string" offset="0"/>
 <match value="X-Mailer:" type="string" offset="0"/>
 <match value="X-Originating-IP:" type="stringignorecase" offset="0"/>
 
<match value="\nDate:" type="stringignorecase" offset="0:1024"/>
 <match value="\nDelivered-To:" type="string" offset="0:1024"/>
 <match value="\nFrom:" type="stringignorecase" offset="0:1024"/>
 <match value="\nMIME-Version:" type="stringignorecase" offset="0:1024"/>
 <match value="\nReceived:" type="stringignorecase" offset="0:1024"/>
 <match value="\nRelay-Version:" type="stringignorecase" offset="0:1024"/>
 <match value="\nReturn-Path:" type="stringignorecase" offset="0:1024"/>
 <match value="\nSent:" type="string" offset="0:1024"/>
 <match value="\nStatus:" type="string" offset="0:1024"/>
 <match value="\nSubject:" type="string" offset="0:1024"/>
 <match value="\nTo:" type="string" offset="0:1024"/>
 <match value="\nUser-Agent:" type="string" offset="0:1024"/>
 <match value="\nX-Mailer:" type="string" offset="0:1024"/>
 <match value="\nX-Originating-IP:" type="stringignorecase" offset="0:1024"/>
 </match>
 <!-- match X- DKIM- ARC- at start of file and then require at least one
 of the usual: from, received, date...but look farther into the file
 because of the X|DKIM|ARC headers-->
 <match value="(X|DKIM|ARC)-" type="regex" offset="0">
 <match value="\nDate:" type="stringignorecase" offset="0:8192"/>
 <match value="\nDelivered-To:" type="string" offset="0:8192"/>
 <match value="\nFrom:" type="stringignorecase" offset="0:8192"/>
 <match value="\nMessage-ID:" type="stringignorecase" offset="0:8192"/>
 <match value="\nMIME-Version:" type="stringignorecase" offset="0:8192"/>
 <match value="\nReceived:" type="stringignorecase" offset="0:8192"/>
 <match value="\nRelay-Version:" type="stringignorecase" offset="0:8192"/>
 <match value="\nReturn-Path:" type="stringignorecase" offset="0:8192"/>
 <match value="\nStatus:" type="string" offset="0:8192"/>
 <match value="\nUser-Agent:" type="string" offset="0:8192"/>
 <match value="\nX-Mailer:" type="string" offset="0:8192"/>
 <match value="\nX-Originating-IP:" type="stringignorecase" offset="0:8192"/>
 </match>
 </magic>
 <magic priority="40">
 <!-- lower priority than message/news -->
 <match value="\nMessage-ID:" type="stringignorecase" offset="0:1000"/>
 </magic>
 <glob pattern="*.eml"/>
 <glob pattern="*.mime"/>
 <sub-class-of type="text/x-tika-text-based-message"/>






It is hitting on rfc22 due to the 1st section since the data since From and 
Sent in text values that Tika use for detection.



Looks like tika has issue reading the file from a pure stream hence why it 
detects data as an octet-stream.  I see it is indeed rfc822 "eml"

> Extension reading it as eml instead of txt
> ------------------------------------------
>
>                 Key: TIKA-3290
>                 URL: https://issues.apache.org/jira/browse/TIKA-3290
>             Project: Tika
>          Issue Type: Bug
>          Components: core, mime
>    Affects Versions: 1.25
>            Reporter: Vamsi Molli
>            Priority: Major
>              Labels: tika-parsers
>             Fix For: 1.24.1
>
>         Attachments: test_sample_message.txt
>
>
> The attached file extension is reading it as eml instead of txt. With version 
> 1.24.1 it is reading it as txt and now with the upgrade to 1.25, it is 
> reading it as eml. So that while parsing we are getting mail corrupted error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to