martin k. created TIKA-4172:
-------------------------------

             Summary: Apple binary file incorrectly identified as text/x-sql 
due to filename
                 Key: TIKA-4172
                 URL: https://issues.apache.org/jira/browse/TIKA-4172
             Project: Tika
          Issue Type: Bug
          Components: general
    Affects Versions: 2.9.1
            Reporter: martin k.


This is related to [https://github.com/eikek/docspell/issues/2376] and 
[https://github.com/eikek/docspell/issues/2403.]

Take the following Base64 encoding of a binary Apple-generated file. No idea 
what it does. You can get the file by piping the following to e.g. {{base64 -d 
> something.sql}}
{code:java}
ABRkMDEwMWM2Nl9teVNRTDQwLnNxbAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAbUJJTgAA 
AAAAAAAAAAAAAAAAAACCgf+/AAA=
{code}
If this file is name {{{}something.sql{}}}, then Tika will classify it as 
{{{}text/x-sql{}}}, which it is not. It seems like more weight is given to the 
filename (extension) than the fact that the file is binary anyway.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to