Re: [EXTERNAL] Re: Tika parser not parsing email content

Tim Allison Tue, 10 Oct 2023 13:16:36 -0700

I opened: https://issues.apache.org/jira/browse/TIKA-4153


RFC822 detection has been a game of whack-a-mole especially with malformed
files.  We should continue to refine the detection/fix this issue.


On Tue, Oct 10, 2023 at 2:07 PM Josh Burchard <burch...@pnp-hcl.com> wrote:

> Reading this surprised me.  It's too bad the default behavior isn't to
> treat any non-.eml files as plain text and require a configuration setting
> to turn on the detection magic. I personally wouldn't have expected the
> noted behavior and it's likely our company's customers are encountering
> this loss of fidelity when we index their file attachments. Is there a Jira
> item where I can read about the reason behind its current implementation?
>  -Josh/HCL
>
>
>
>
> From:        "Tim Allison" <talli...@apache.org>
> To:        user@tika.apache.org
> Date:        10/10/2023 12:47 PM
> Subject:        [EXTERNAL] Re: Tika parser not parsing email content
> ------------------------------
>
>
>
> I can confirm this is still happening in our main/3.x branch. As you
> probably guessed, the issue is that the file is identified as an email and
> then parsed as if it were one.  If you know that all you have are plain
> text files, you might consider using the TextAndCSVParser or just the
> TXTParser.
>
> One fix for this (and this is for the devs on the list), would be to
> modify our minShouldMatch so that we have at least one of the field
> patterns at offset 0 and then one of the other field patterns at 0:1024. We
> currently require only two of the fields anywhere within the first 1024
> characters.
>
> On Tue, Oct 10, 2023 at 7:12 AM Kashif Khan <*kashif.k...@verantos.com*
> <kashif.k...@verantos.com>> wrote:
>
> Hi team,
> I have been working on the Tika parser to parse a few text files and it
> has been working fine until I have come to an issue where it is not able to
> parse the text file if it contains 'email/message contents'.
> This means if the text file contains any of the terms like 'From: ', 'To:
> ', or 'Sent: ', it will fail to parse the text correctly.
> In my case, the parser is deleting the lines of text files and only a
> single line remains out of 40 lines.
>
> I am sharing a snippet of the text file for an example:
>
>
>
>
> *Some text here 1. Some text here 2. Some text here 3. Original
> Message----- From: **some_m...@abc.com* <some_m...@abc.com>
>
>
>
>
>
>
>
> * Sent: Thursday, October 31, 2019 9:52 AM To: Some person, (The XYZ
> group) Subject: RE: Mr. Random person phone call: MESSAGE Hi, I am
> available now to receive the call. Some text here 4. Some text here 5. Some
> text here 6.*
>
> The Tika parser is reducing the above text to only one line as below:
> *Subject: RE: Mr. Random person phone call: MESSAGE*
>
> Note that this is happening in the version later than Tika 1.19, with 1.19
> is parsing the contents perfectly fine.
>
> Could you please help me to understand the issue or please suggest some
> path forward to this?
> This will be very helpful.
>
> Thanks in advance.
> -Kashif
>
>
>

Re: [EXTERNAL] Re: Tika parser not parsing email content

Reply via email to