martin k. created TIKA-4172: ------------------------------- Summary: Apple binary file incorrectly identified as text/x-sql due to filename Key: TIKA-4172 URL: https://issues.apache.org/jira/browse/TIKA-4172 Project: Tika Issue Type: Bug Components: general Affects Versions: 2.9.1 Reporter: martin k.
This is related to [https://github.com/eikek/docspell/issues/2376] and [https://github.com/eikek/docspell/issues/2403.] Take the following Base64 encoding of a binary Apple-generated file. No idea what it does. You can get the file by piping the following to e.g. {{base64 -d > something.sql}} {code:java} ABRkMDEwMWM2Nl9teVNRTDQwLnNxbAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAbUJJTgAA AAAAAAAAAAAAAAAAAACCgf+/AAA= {code} If this file is name {{{}something.sql{}}}, then Tika will classify it as {{{}text/x-sql{}}}, which it is not. It seems like more weight is given to the filename (extension) than the fact that the file is binary anyway. -- This message was sent by Atlassian Jira (v8.20.10#820010)