I opened: https://issues.apache.org/jira/browse/TIKA-4153
RFC822 detection has been a game of whack-a-mole especially with malformed files. We should continue to refine the detection/fix this issue. On Tue, Oct 10, 2023 at 2:07 PM Josh Burchard <burch...@pnp-hcl.com> wrote: > Reading this surprised me. It's too bad the default behavior isn't to > treat any non-.eml files as plain text and require a configuration setting > to turn on the detection magic. I personally wouldn't have expected the > noted behavior and it's likely our company's customers are encountering > this loss of fidelity when we index their file attachments. Is there a Jira > item where I can read about the reason behind its current implementation? > -Josh/HCL > > > > > From: "Tim Allison" <talli...@apache.org> > To: user@tika.apache.org > Date: 10/10/2023 12:47 PM > Subject: [EXTERNAL] Re: Tika parser not parsing email content > ------------------------------ > > > > I can confirm this is still happening in our main/3.x branch. As you > probably guessed, the issue is that the file is identified as an email and > then parsed as if it were one. If you know that all you have are plain > text files, you might consider using the TextAndCSVParser or just the > TXTParser. > > One fix for this (and this is for the devs on the list), would be to > modify our minShouldMatch so that we have at least one of the field > patterns at offset 0 and then one of the other field patterns at 0:1024. We > currently require only two of the fields anywhere within the first 1024 > characters. > > On Tue, Oct 10, 2023 at 7:12 AM Kashif Khan <*kashif.k...@verantos.com* > <kashif.k...@verantos.com>> wrote: > > Hi team, > I have been working on the Tika parser to parse a few text files and it > has been working fine until I have come to an issue where it is not able to > parse the text file if it contains 'email/message contents'. > This means if the text file contains any of the terms like 'From: ', 'To: > ', or 'Sent: ', it will fail to parse the text correctly. > In my case, the parser is deleting the lines of text files and only a > single line remains out of 40 lines. > > I am sharing a snippet of the text file for an example: > > > > > *Some text here 1. Some text here 2. Some text here 3. Original > Message----- From: **some_m...@abc.com* <some_m...@abc.com> > > > > > > > > * Sent: Thursday, October 31, 2019 9:52 AM To: Some person, (The XYZ > group) Subject: RE: Mr. Random person phone call: MESSAGE Hi, I am > available now to receive the call. Some text here 4. Some text here 5. Some > text here 6.* > > The Tika parser is reducing the above text to only one line as below: > *Subject: RE: Mr. Random person phone call: MESSAGE* > > Note that this is happening in the version later than Tika 1.19, with 1.19 > is parsing the contents perfectly fine. > > Could you please help me to understand the issue or please suggest some > path forward to this? > This will be very helpful. > > Thanks in advance. > -Kashif > > >