Andreas Meier created TIKA-2578: ----------------------------------- Summary: Mails not recognized when unknown X-headers are present Key: TIKA-2578 URL: https://issues.apache.org/jira/browse/TIKA-2578 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.17, 1.18, 2.0.0 Reporter: Andreas Meier Attachments: testRFC822_with_leading_x_header
Found some mails with leading X-headers. These mails are recognized as text/plain. One example is CISCOs IronPort, which might add "X-IronPort-AV" to the beginning of mails. Therefore I would like to discuss if and how TIKA shall handle these cases. In my opinion TIKA should try to detect files with x-headers and preprocess them to get a valid mail. Suggestion: {code:xml} <mime-type type="text/x-tika-x-header"> <magic priority="50"> <match value="X-" type="string" offset="0"> <match value="Message-ID:" type="string" offset="0:8192"/> <match value="From:" type="stringignorecase" offset="0:8192"/> <match value="To:" type="stringignorecase" offset="0:8192"/> <match value="Subject:" type="string" offset="0:8192"/> <match value="MIME-Version:" type="stringignorecase" offset="0:8192"/> </match> </magic> <sub-class-of type="text/x-tika-text-based-message"/> </mime-type> {code} See also: [RFC6648|https://tools.ietf.org/html/rfc6648] Attached an example file. Regards Andreas -- This message was sent by Atlassian JIRA (v7.6.3#76005)