[ 
https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004161#comment-13004161
 ] 

Sjoerd Smeets commented on TIKA-461:
------------------------------------

I've added a patch that extracts some more metadata fields in the 
MailContentHandler. It extracts:
- MESSAGE_FROM
- MESSAGE_TO
- MESSAGE_CC
- MESSAGE_BCC
- CREATION_DATE

The metadata AUTHOR is duplicated in the MESSAGE_FROM field, as it has the same 
meaning as MESSAGE_FROM. Perhaps AUTHOR is superfluous? Furhtermore, I figured 
out that some mails from the enron data set do not contain email addresses in 
the adress lists. mime4j determines that these fields are invalid, however, 
they contain useful information. Therefore I've added an additional metadafield 
extractor in case mime4j decides a field is invalid. Some tests and the 
concerning Enron emails are added.


> RFC822 messages not parsed
> --------------------------
>
>                 Key: TIKA-461
>                 URL: https://issues.apache.org/jira/browse/TIKA-461
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Joshua Turner
>            Assignee: Julien Nioche
>         Attachments: TIKA-461-config.patch, TIKA-461-parse.patch, 
> TIKA-461-plus-tests-1.patch, TIKA-461.patch, testRFC822-multipart
>
>
> Presented with an RFC822 message exported from Thunderbird, AutodetectParser 
> produces an empty body, and a Metadata containing only one key-value pair: 
> "Content-Type=message/rfc822". Directly calling MboxParser likewise gives an 
> empty body, but with two metadata pairs: "Content-Encoding=us-ascii 
> Content-Type=application/mbox".
> A quick peek at the source of MboxParser shows that the implementation is 
> pretty naive. If the wiring can be sorted out, something like Apache James' 
> mime4j might be a better bet.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to