Reading this surprised me.  It's too bad the default behavior isn't to 
treat any non-.eml files as plain text and require a configuration setting 
to turn on the detection magic. I personally wouldn't have expected the 
noted behavior and it's likely our company's customers are encountering 
this loss of fidelity when we index their file attachments. Is there a 
Jira item where I can read about the reason behind its current 
implementation?  -Josh/HCL




From:   "Tim Allison" <talli...@apache.org>
To:     user@tika.apache.org
Date:   10/10/2023 12:47 PM
Subject:        [EXTERNAL] Re: Tika parser not parsing email content



I can confirm this is still happening in our main/3.x branch. As you 
probably guessed, the issue is that the file is identified as an email and 
then parsed as if it were one.  If you know that all you have are plain 
text files, you might consider using the TextAndCSVParser or just the 
TXTParser.

One fix for this (and this is for the devs on the list), would be to 
modify our minShouldMatch so that we have at least one of the field 
patterns at offset 0 and then one of the other field patterns at 0:1024. 
We currently require only two of the fields anywhere within the first 1024 
characters.

On Tue, Oct 10, 2023 at 7:12 AM Kashif Khan <kashif.k...@verantos.com> 
wrote:



Hi team,
I have been working on the Tika parser to parse a few text files and it 
has been working fine until I have come to an issue where it is not able 
to parse the text file if it contains 'email/message contents'.
This means if the text file contains any of the terms like 'From: ', 'To: 
', or 'Sent: ', it will fail to parse the text correctly.
In my case, the parser is deleting the lines of text files and only a 
single line remains out of 40 lines.

I am sharing a snippet of the text file for an example:
Some text here 1.
Some text here 2.
Some text here 3.
Original Message-----
From: some_m...@abc.com
Sent: Thursday, October 31, 2019 9:52 AM
To: Some person, (The XYZ group)
Subject: RE: Mr. Random person phone call: MESSAGE
Hi,
I am available now to receive the call.
Some text here 4.
Some text here 5.
Some text here 6.

The Tika parser is reducing the above text to only one line as below:
Subject: RE: Mr. Random person phone call: MESSAGE

Note that this is happening in the version later than Tika 1.19, with 1.19 
is parsing the contents perfectly fine.

Could you please help me to understand the issue or please suggest some 
path forward to this?
This will be very helpful.

Thanks in advance.
-Kashif



Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to