I'm wondering if we can tighten the detection to include a newline after the P2, etc. It looks like we require a new line for some of those file format variants. Let me do some research, unless anyone happens to know.
On Mon, Mar 18, 2024 at 4:40 PM Kashif Khan <[email protected]> wrote: > Hi, > I tried configuring the tika configuration using the config file and > importing it to the program where I am parsing the text, but that didn't > work and I am still getting the same error/result. > Basically, I want my program (using tika for parsing) to consider any kind > of data that is provided as a simple "text" and nothing else. > > Could you please suggest a path forward how I can solve this? > > -Kashif > > On Sun, Mar 17, 2024 at 10:23 PM Tilman Hausherr <[email protected]> > wrote: > >> Hi, >> >> The best would of course be that you don't make it look as if your text >> files are something else. >> >> The second best: fine tune the tika configuration >> https://tika.apache.org/2.9.1/configuring.html >> >> Tilman >> >> On 17.03.2024 17:46, Kashif Khan wrote: >> >> Do you think it is an issue to be fixed? And also, is there a workaround >> for this to work? >> >> On Sun, Mar 17, 2024, 5:03 PM Tilman Hausherr <[email protected]> >> wrote: >> >>> The first one is recognized as image/x-portable-graymap because "P2" is >>> a magic number for that type. >>> >>> "P1" is a magic number for image/x-portable-bitmap. >>> >>> Tilman >>> >>> On 16.03.2024 12:37, Kashif Khan wrote: >>> >>> Hello Tim/Forum, >>> >>> While I am trying to parse the below content the result is null/empty: >>> *"P2P He has Asthma"* >>> OR >>> *"P18-8610 He has Asthma"* >>> OR >>> *"P2P Scheduled as He had breathing issues *for the last* 1 year."* >>> >>> Whereas, the below gets parsed without any issues: >>> *"He has Asthma"* >>> *"Appointment Scheduled as He had breathing issues for last 1 year."* >>> >>> Could you please help in understand the exact issue and help with the >>> resolution? >>> >>> -Kashif Khan >>> [email protected] >>> >>> >>> >>
