[ 
https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215838#comment-16215838
 ] 

Ken Krugler commented on TIKA-2478:
-----------------------------------

Hi [~talli...@apache.org] - I've attached two mixed examples I'd used for my 
own testing.

Regarding how to handle {{multipart/mixed}}, the approach I used was to 
recursively extract text, where the current "best type" was part of the state. 
So initially set to nothing, then if it hit html it would set the type to that. 
My preference was text > html > rtf, since the email client had already done a 
good job of creating plain text, but since Tika is trying to return XHTML, I 
guess it would be html > rtf > text.

Note that if you have two HTML pieces with something in between (see 
https://issues.apache.org/jira/secure/attachment/12893599/mixed-with-pdf-inline),
 then each HTML is a fragment that you'll need to stitch together.

> MBOX import includes redundant copies of the text
> -------------------------------------------------
>
>                 Key: TIKA-2478
>                 URL: https://issues.apache.org/jira/browse/TIKA-2478
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.16
>            Reporter: Robert Letzler
>            Assignee: Tim Allison
>            Priority: Minor
>         Attachments: UET6KCXR5FYIEJYKUCK2AKF3FLXTRNAT.eml, mixed-simple, 
> mixed-with-pdf-inline
>
>
> MBOX messages often get parsed into four documents:
> a.    The mbox file - outer container "/"
> b.    The actual email--  "/embedded-1"
> c.    The utf-8 text content of the email "/embedded-1/embedded-2"
> d.    The utf-8 html content of the email  "/embedded-1/embedded-3"
> entries C and D are redundant and distracting.  The MSG parser parses the 
> first non-null: email body and then it skips the rest.  Please modify MBOX to 
> not have separate "attached" documents for the html body and the text body.
> The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an 
> example of input sufficient to generate this behavior.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to