This is the input file; I think it was not uploaded correctly.

Best regards,
Gerardo
________________________________
From: Gerardo Hernandez
Sent: Thursday, January 18, 2024 10:39 PM
To: user@tika.apache.org <user@tika.apache.org>
Cc: Mikhail Gushinets <mikhail.gushin...@aparavi.com>
Subject: Parser removes file content and treats it as Metadata

Hi,

We are using Tika parser to obtain files' contents and then we do some post 
processing on them, unfortunately we recently got some unexpected results from 
the AutoDectectParser using the attached text file 
[https://res.cdn.office.net/assets/mail/file-icon/png/txt_16x16.png] 
SampleFile_M_001.txt<https://aparavi-my.sharepoint.com/:t:/p/g_hernandez/EUPjfGMN1k1Pii3e4h6tzNoBuVrxR7pAsRugZf-Y59Cmjg>.
 Basically, what we expect as result is the whole text in the file, but we only 
get (received by Handler):

​SUBJECT: XYZ EMPL. OPPORUNITY

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor 
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam... (Till 
the end of the file).

and the initial text of the file (FROM, TO, DATE, LOCATION) is not included but 
registered as metadata:

[cid:48357a6d-fda2-4d43-bdea-e7ad165b2cd9]

I would like to know if there is any way to prevent this from happening using 
AutoDectectParser so that all the text is included in the data sent to the 
Handler.
FROM: XYZ EMPL. OPPORUNITY
TO: DFG. OF ABC
DATE: 2020
LOCATION: A.B.C Dist

SUBJECT: XYZ EMPL. OPPORUNITY

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor 
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis 
nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. 
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu 
fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in 
culpa qui officia deserunt mollit anim id est laborum.

Reply via email to