[
https://issues.apache.org/jira/browse/TIKA-295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jukka Zitting resolved TIKA-295.
--------------------------------
Resolution: Fixed
Fix Version/s: 0.5
Assignee: Jukka Zitting
Nice work, thanks! I committed the patch (with tabs->spaces changes and an
added license header for the test case) in revision 820967.
For further work on this I would suggest using the Mime4J library [1] from
Apache James, as they've already dealt with many of the questions you raise
above.
I'm resolving this as Fixed as the basic feature is now there thanks to the
patch. Please file additional issues on any future improvements.
[1] http://james.apache.org/mime4j/
> Rough cut of mbox parser
> ------------------------
>
> Key: TIKA-295
> URL: https://issues.apache.org/jira/browse/TIKA-295
> Project: Tika
> Issue Type: New Feature
> Affects Versions: 0.4
> Reporter: Ken Krugler
> Assignee: Jukka Zitting
> Fix For: 0.5
>
> Attachments: tika-295.patch
>
>
> Attached is a patch for a first-cut at a parser that handles mailbox (.mbox,
> application/mbox) files.
> * The first email headers are used to fill in metadata. Subsequent email
> headers are tossed.
> * Charset handling needs to be fixed up. It's unclear (not spec'd) whether
> emails individually use the charset as specified in their individual header,
> or the entire file should be re-encoded (and the encoding is sent in the
> response header, or auto-detected).
> * Multi-part emails won't be handled properly, though it's unclear what
> should be done in that case (if anything).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.