If you know that you're only parsing text files, you could configure only
the TextOrCSVParser and specify that it processes "application/octet". This
should force every file to be processed by that parser. Something like this?
<properties>
<parsers>
<parser class="org.apache.tika.parser.csv.TextAndCSVParser">
<mime>application/octet-stream</mime>
</parser>
</parsers>
</properties>
Or you could tell tika to parse these portable bitmaps with the
textandcsvparser, something along these lines:
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
<parser-exclude class="org.apache.tika.parser.csv.TextAndCSVParser"/>
</parser>
<parser class="org.apache.tika.parser.csv.TextAndCSVParser">
<mime>image/x-portable-graymap</mime>
<mime>image/x-portable-bitmap</mime>
<mime>text/plain</mime>
<mime>text/csv</mime>
<mime>text/tsv</mime>
</parser>
</parsers>
</properties>
On Mon, Mar 18, 2024 at 4:40 PM Kashif Khan <[email protected]>
wrote:
> Hi,
> I tried configuring the tika configuration using the config file and
> importing it to the program where I am parsing the text, but that didn't
> work and I am still getting the same error/result.
> Basically, I want my program (using tika for parsing) to consider any kind
> of data that is provided as a simple "text" and nothing else.
>
> Could you please suggest a path forward how I can solve this?
>
> -Kashif
>
> On Sun, Mar 17, 2024 at 10:23 PM Tilman Hausherr <[email protected]>
> wrote:
>
>> Hi,
>>
>> The best would of course be that you don't make it look as if your text
>> files are something else.
>>
>> The second best: fine tune the tika configuration
>> https://tika.apache.org/2.9.1/configuring.html
>>
>> Tilman
>>
>> On 17.03.2024 17:46, Kashif Khan wrote:
>>
>> Do you think it is an issue to be fixed? And also, is there a workaround
>> for this to work?
>>
>> On Sun, Mar 17, 2024, 5:03 PM Tilman Hausherr <[email protected]>
>> wrote:
>>
>>> The first one is recognized as image/x-portable-graymap because "P2" is
>>> a magic number for that type.
>>>
>>> "P1" is a magic number for image/x-portable-bitmap.
>>>
>>> Tilman
>>>
>>> On 16.03.2024 12:37, Kashif Khan wrote:
>>>
>>> Hello Tim/Forum,
>>>
>>> While I am trying to parse the below content the result is null/empty:
>>> *"P2P He has Asthma"*
>>> OR
>>> *"P18-8610 He has Asthma"*
>>> OR
>>> *"P2P Scheduled as He had breathing issues *for the last* 1 year."*
>>>
>>> Whereas, the below gets parsed without any issues:
>>> *"He has Asthma"*
>>> *"Appointment Scheduled as He had breathing issues for last 1 year."*
>>>
>>> Could you please help in understand the exact issue and help with the
>>> resolution?
>>>
>>> -Kashif Khan
>>> [email protected]
>>>
>>>
>>>
>>