[jira] [Commented] (TIKA-2478) MBOX import includes redundant copies of the text

Tim Allison (JIRA) Mon, 23 Oct 2017 18:54:59 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16216184#comment-16216184
 ]


Tim Allison commented on TIKA-2478:
-----------------------------------

[~kkrugler]'s {{mixed-withpdf-inline}} has
{noformat}
* multipart alternative
  * text
    * multipart mixed
      * html (7280EA35-27ED-4485-9978-4D9FFCE613A6)
      * pdf
      * html (7280EA35-27ED-4485-9978-4D9FFCE613A6)
{noformat}

{{mixed-simple}} has:
{noformat}
* multipart/mixed
  * multipart/related
    *multipart/alternative
      *text
      *html
  * image/jpeg (inline)
  * image/jpeg (inline)
{noformat}

Our current {{testRFC822-multipart}} has:
{noformat}
*multipart/mixed
  *multipart/alternative
    *text
    *html
  *image/gif
{noformat}




> MBOX import includes redundant copies of the text
> -------------------------------------------------
>
>                 Key: TIKA-2478
>                 URL: https://issues.apache.org/jira/browse/TIKA-2478
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.16
>            Reporter: Robert Letzler
>            Assignee: Tim Allison
>            Priority: Minor
>         Attachments: UET6KCXR5FYIEJYKUCK2AKF3FLXTRNAT.eml, mixed-simple, 
> mixed-with-pdf-inline
>
>
> MBOX messages often get parsed into four documents:
> a.    The mbox file - outer container "/"
> b.    The actual email--  "/embedded-1"
> c.    The utf-8 text content of the email "/embedded-1/embedded-2"
> d.    The utf-8 html content of the email  "/embedded-1/embedded-3"
> entries C and D are redundant and distracting.  The MSG parser parses the 
> first non-null: email body and then it skips the rest.  Please modify MBOX to 
> not have separate "attached" documents for the html body and the text body.
> The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an 
> example of input sufficient to generate this behavior.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (TIKA-2478) MBOX import includes redundant copies of the text

Reply via email to