The PRONOM file format signature for image/x-portable-bitmap is '“P1” followed by a whitespace char (blank, TAB, CR, LF).' which would at least tighten it up a bit. Ditto for image/x-portable-graymap
https://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=236&strPageToDisplay=signatures https://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1155&strPageToDisplay=signatures On Wed, Mar 20, 2024 at 2:14 PM Tim Allison <[email protected]> wrote: > I'm wondering if we can tighten the detection to include a newline after > the P2, etc. It looks like we require a new line for some of those file > format variants. Let me do some research, unless anyone happens to know. > > On Mon, Mar 18, 2024 at 4:40 PM Kashif Khan <[email protected]> > wrote: > >> Hi, >> I tried configuring the tika configuration using the config file and >> importing it to the program where I am parsing the text, but that didn't >> work and I am still getting the same error/result. >> Basically, I want my program (using tika for parsing) to consider any >> kind of data that is provided as a simple "text" and nothing else. >> >> Could you please suggest a path forward how I can solve this? >> >> -Kashif >> >> On Sun, Mar 17, 2024 at 10:23 PM Tilman Hausherr <[email protected]> >> wrote: >> >>> Hi, >>> >>> The best would of course be that you don't make it look as if your text >>> files are something else. >>> >>> The second best: fine tune the tika configuration >>> https://tika.apache.org/2.9.1/configuring.html >>> >>> Tilman >>> >>> On 17.03.2024 17:46, Kashif Khan wrote: >>> >>> Do you think it is an issue to be fixed? And also, is there a workaround >>> for this to work? >>> >>> On Sun, Mar 17, 2024, 5:03 PM Tilman Hausherr <[email protected]> >>> wrote: >>> >>>> The first one is recognized as image/x-portable-graymap because "P2" is >>>> a magic number for that type. >>>> >>>> "P1" is a magic number for image/x-portable-bitmap. >>>> >>>> Tilman >>>> >>>> On 16.03.2024 12:37, Kashif Khan wrote: >>>> >>>> Hello Tim/Forum, >>>> >>>> While I am trying to parse the below content the result is null/empty: >>>> *"P2P He has Asthma"* >>>> OR >>>> *"P18-8610 He has Asthma"* >>>> OR >>>> *"P2P Scheduled as He had breathing issues *for the last* 1 year."* >>>> >>>> Whereas, the below gets parsed without any issues: >>>> *"He has Asthma"* >>>> *"Appointment Scheduled as He had breathing issues for last 1 year."* >>>> >>>> Could you please help in understand the exact issue and help with the >>>> resolution? >>>> >>>> -Kashif Khan >>>> [email protected] >>>> >>>> >>>> >>> -- Greg Lepore Information Technology Specialist National Archives at College Park 8601 Adelphi Road, Rm 4300 College Park, MD 20740 Cell 443-741-0970 (personal)
