I can confirm this is still happening in our main/3.x branch. As you
probably guessed, the issue is that the file is identified as an email and
then parsed as if it were one.  If you know that all you have are plain
text files, you might consider using the TextAndCSVParser or just the
TXTParser.

One fix for this (and this is for the devs on the list), would be to modify
our minShouldMatch so that we have at least one of the field patterns at
offset 0 and then one of the other field patterns at 0:1024. We currently
require only two of the fields anywhere within the first 1024 characters.

On Tue, Oct 10, 2023 at 7:12 AM Kashif Khan <[email protected]>
wrote:

> Hi team,
> I have been working on the Tika parser to parse a few text files and it
> has been working fine until I have come to an issue where it is not able to
> parse the text file if it contains 'email/message contents'.
> This means if the text file contains any of the terms like 'From: ', 'To:
> ', or 'Sent: ', it will fail to parse the text correctly.
> In my case, the parser is deleting the lines of text files and only a
> single line remains out of 40 lines.
>
> I am sharing a snippet of the text file for an example:
>
>>
>> *Some text here 1.*
>> *Some text here 2.*
>> *Some text here 3.*
>> *Original Message-----*
>> *From: [email protected] <[email protected]>*
>> *Sent: Thursday, October 31, 2019 9:52 AM*
>> *To: Some person, (The XYZ group)*
>> *Subject: RE: Mr. Random person phone call: MESSAGE*
>> *Hi,*
>> *I am available now to receive the call.*
>> *Some text here 4.*
>> *Some text here 5.**Some text here 6.*
>
>
> The Tika parser is reducing the above text to only one line as below:
>
>> *Subject: RE: Mr. Random person phone call: MESSAGE*
>
>
> Note that this is happening in the version later than Tika 1.19, with 1.19
> is parsing the contents perfectly fine.
>
> Could you please help me to understand the issue or please suggest some
> path forward to this?
> This will be very helpful.
>
> Thanks in advance.
> -Kashif
>
>

Reply via email to