Hi team,
I have been working on the Tika parser to parse a few text files and it has
been working fine until I have come to an issue where it is not able to
parse the text file if it contains 'email/message contents'.
This means if the text file contains any of the terms like 'From: ', 'To:
', or 'Sent: ', it will fail to parse the text correctly.
In my case, the parser is deleting the lines of text files and only a
single line remains out of 40 lines.

I am sharing a snippet of the text file for an example:

>
> *Some text here 1.*
> *Some text here 2.*
> *Some text here 3.*
> *Original Message-----*
> *From: some_m...@abc.com <some_m...@abc.com>*
> *Sent: Thursday, October 31, 2019 9:52 AM*
> *To: Some person, (The XYZ group)*
> *Subject: RE: Mr. Random person phone call: MESSAGE*
> *Hi,*
> *I am available now to receive the call.*
> *Some text here 4.*
> *Some text here 5.**Some text here 6.*


The Tika parser is reducing the above text to only one line as below:

> *Subject: RE: Mr. Random person phone call: MESSAGE*


Note that this is happening in the version later than Tika 1.19, with 1.19
is parsing the contents perfectly fine.

Could you please help me to understand the issue or please suggest some
path forward to this?
This will be very helpful.

Thanks in advance.
-Kashif

Reply via email to