Hi Gerardo,

What happens if you set the filename in the metadata, before calling parse()?

E.g.

   metadata.set(Metadata.RESOURCE_NAME_KEY, filename);

I don’t recall whether the Resource Name detector will be called first, before the Mime Magic detector (Tim?). If it is, then having a xxx.txt filename _should_ trigger Tika to use the generic text parser, versus the email parser.

— Ken

On Jan 23, 2024, at 11:26 AM, Gerardo Hernandez <g.hernan...@aparavi.com> wrote:

Surely, I attached a simplified version of the code we use, please let me know if there are any way to configure the behavior of the parser so that the initial lines
are also included in the handler contents.

Regards,
Gerardo

From: Ken Krugler <kkrugler_li...@transpac.com>
Sent: Saturday, January 20, 2024 11:54 AM
To: user@tika.apache.org <user@tika.apache.org>
Cc: Mikhail Gushinets <mikhail.gushin...@aparavi.com>
Subject: Re: Parser removes file content and treats it as Metadata
 
I assume you are getting the initial lines as metadata because Tika is identifying the file as email.

If you include details on your code (how you are calling the parser) and version, I’m confident someone can suggest reasonable work-arounds.

Regards,

— Ken


On Jan 18, 2024, at 8:44 PM, Gerardo Hernandez <g.hernan...@aparavi.com> wrote:

This is the input file; I think it was not uploaded correctly.

Best regards,
Gerardo

 
From: Gerardo Hernandez
Sent: Thursday, January 18, 2024 10:39 PM
To: user@tika.apache.org <user@tika.apache.org>
Cc: Mikhail Gushinets <mikhail.gushin...@aparavi.com>
Subject: Parser removes file content and treats it as Metadata
 
Hi, 

We are using Tika parser to obtain files' contents and then we do some post processing on them, unfortunately we recently got some unexpected results from the AutoDectectParser using the attached text file SampleFile_M_001.txt. Basically, what we expect as result is the whole text in the file, but we only get (received by Handler): 

SUBJECT: XYZ EMPL. OPPORUNITY

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam... (Till the end of the file).

and the initial text of the file (FROM, TO, DATE, LOCATION) is not included but registered as metadata:

<image.png>

I would like to know if there is any way to prevent this from happening using AutoDectectParser so that all the text is included in the data sent to the Handler.
<SampleFile_M_001.txt>

--------------------------
Ken Krugler
Custom big data solutions
Flink & Pinot



Attachment: TikaExample.java
Description: Binary data




--------------------------
Ken Krugler
http://www.scaleunlimited.com
Custom big data solutions
Flink & Pinot



Reply via email to