[ 
https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915285#action_12915285
 ] 

Nick Burch commented on TIKA-461:
---------------------------------

It'd probably be good to see some more tests with it. For now, just checking 
your basic message should be fine, but I'd suggest we also try to get an email 
with plain text, html, images and similar in to check the more complex bits.

In terms of the nested parser, I'm tempted to say we do something so that plain 
text comes out without any extra work needed. Anything else gets handled via a 
Parser fetched from the ParseContext if required, much as we're doing for 
container formats like zip, .docx etc. That way, you can throw a simple email 
at it and get the text, but the rest of the parts are available if you want them

Also, the james jars need to be listed in the tika bundle pom so they get 
properly included

> RFC822 messages not parsed
> --------------------------
>
>                 Key: TIKA-461
>                 URL: https://issues.apache.org/jira/browse/TIKA-461
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Joshua Turner
>            Assignee: Julien Nioche
>         Attachments: TIKA-461.patch
>
>
> Presented with an RFC822 message exported from Thunderbird, AutodetectParser 
> produces an empty body, and a Metadata containing only one key-value pair: 
> "Content-Type=message/rfc822". Directly calling MboxParser likewise gives an 
> empty body, but with two metadata pairs: "Content-Encoding=us-ascii 
> Content-Type=application/mbox".
> A quick peek at the source of MboxParser shows that the implementation is 
> pretty naive. If the wiring can be sorted out, something like Apache James' 
> mime4j might be a better bet.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to