[jira] [Commented] (TIKA-4172) Apple binary file incorrectly identified as text/x-sql due to filename
[ https://issues.apache.org/jira/browse/TIKA-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789647#comment-17789647 ] Tilman Hausherr commented on TIKA-4172: --- Your file starts with 00 14 64 30. See also https://www.iana.org/assignments/media-types/application/applefile No I don't agree, because: what is a "binary" file after all? There is no fixed definition for this, it's just a file that hasn't been classified. > Apple binary file incorrectly identified as text/x-sql due to filename > -- > > Key: TIKA-4172 > URL: https://issues.apache.org/jira/browse/TIKA-4172 > Project: Tika > Issue Type: Bug > Components: general >Affects Versions: 2.9.1 >Reporter: martin k. >Priority: Minor > > This is related to [https://github.com/eikek/docspell/issues/2376] and > [https://github.com/eikek/docspell/issues/2403.] > Take the following Base64 encoding of a binary Apple-generated file. No idea > what it does. You can get the file by piping the following to e.g. {{base64 > -d > something.sql}} > {code:java} > ABRkMDEwMWM2Nl9teVNRTDQwLnNxbAAA > bUJJTgAA > AACCgf+/AAA= > {code} > If this file is name {{{}something.sql{}}}, then Tika will classify it as > {{{}text/x-sql{}}}, which it is not. It seems like more weight is given to > the filename (extension) than the fact that the file is binary anyway. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4172) Apple binary file incorrectly identified as text/x-sql due to filename
[ https://issues.apache.org/jira/browse/TIKA-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789621#comment-17789621 ] martin k. commented on TIKA-4172: - Right, [~tilman], however, {{application/octet-stream}} is also a valid MIME type. But don't you agree that when a file is classified by content as {{application/octet-stream}} and I provide the information that the file just so happens to be called {{something.sql}} and has a content-type of {{application/applefile}}, then tika certainly shouldn't claim it to be {{text/x-sql}}, which it is least of all of them? Maybe one issue is that the file itself does seem to be corrupt. Libmagic 5.45 reports it as "MacBinary III INVALID date", but tika doesn't seem to want to treat it as {{application/applefile}} at all: {{% curl -T ~/.tmp/d0101c66_mySQL40.sql -H "Content-Type: application/applefile" -H "Content-Disposition: attachment; filename=d0101c66_mySQL40.sql" http://localhost:9998/meta/Content-Type}} {{Failed to get metadata field Content-Type}} Regardless though, it doesn't seem right to then ignore the fact that the file *is* binary, and return a {text/*}} content-type instead. Do you agree? > Apple binary file incorrectly identified as text/x-sql due to filename > -- > > Key: TIKA-4172 > URL: https://issues.apache.org/jira/browse/TIKA-4172 > Project: Tika > Issue Type: Bug > Components: general >Affects Versions: 2.9.1 >Reporter: martin k. >Priority: Minor > > This is related to [https://github.com/eikek/docspell/issues/2376] and > [https://github.com/eikek/docspell/issues/2403.] > Take the following Base64 encoding of a binary Apple-generated file. No idea > what it does. You can get the file by piping the following to e.g. {{base64 > -d > something.sql}} > {code:java} > ABRkMDEwMWM2Nl9teVNRTDQwLnNxbAAA > bUJJTgAA > AACCgf+/AAA= > {code} > If this file is name {{{}something.sql{}}}, then Tika will classify it as > {{{}text/x-sql{}}}, which it is not. It seems like more weight is given to > the filename (extension) than the fact that the file is binary anyway. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4172) Apple binary file incorrectly identified as text/x-sql due to filename
[ https://issues.apache.org/jira/browse/TIKA-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789542#comment-17789542 ] Tilman Hausherr commented on TIKA-4172: --- application/octet-stream is defined as the default by the detection interface if it doesn't know. tika-mimetypes.xml does't seem to have anything magic that matches your file content. > Apple binary file incorrectly identified as text/x-sql due to filename > -- > > Key: TIKA-4172 > URL: https://issues.apache.org/jira/browse/TIKA-4172 > Project: Tika > Issue Type: Bug > Components: general >Affects Versions: 2.9.1 >Reporter: martin k. >Priority: Minor > > This is related to [https://github.com/eikek/docspell/issues/2376] and > [https://github.com/eikek/docspell/issues/2403.] > Take the following Base64 encoding of a binary Apple-generated file. No idea > what it does. You can get the file by piping the following to e.g. {{base64 > -d > something.sql}} > {code:java} > ABRkMDEwMWM2Nl9teVNRTDQwLnNxbAAA > bUJJTgAA > AACCgf+/AAA= > {code} > If this file is name {{{}something.sql{}}}, then Tika will classify it as > {{{}text/x-sql{}}}, which it is not. It seems like more weight is given to > the filename (extension) than the fact that the file is binary anyway. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4172) Apple binary file incorrectly identified as text/x-sql due to filename
[ https://issues.apache.org/jira/browse/TIKA-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789440#comment-17789440 ] martin k. commented on TIKA-4172: - Hey [~tilman], I did read over the docs, and I should have stated this. Later in the document (well, I read version 2.9.1), under "The default Mime Types Detector" it also says: {quote}Firstly, magic based detection is used on the start of the file. … Next, if available, the filename (from TikaCoreProperties.RESOURCE_NAME_KEY) is then used to improve the detail of the detection, such as when magic detects a text file, and the filename hints it's really a CSV. Finally, if available, the supplied content type (from Metadata.CONTENT_TYPE) is used to further refine the type.{quote} This suggests that the filename is only used to refine the detection. However, in the case I highlighted, the content-based detection would have detected a binary file ({{application/octet-stream}}), which then gets overruled by the filename-based "refinement" into a {{text/*}} content type, and I think this shouldn't be. > Apple binary file incorrectly identified as text/x-sql due to filename > -- > > Key: TIKA-4172 > URL: https://issues.apache.org/jira/browse/TIKA-4172 > Project: Tika > Issue Type: Bug > Components: general >Affects Versions: 2.9.1 >Reporter: martin k. >Priority: Minor > > This is related to [https://github.com/eikek/docspell/issues/2376] and > [https://github.com/eikek/docspell/issues/2403.] > Take the following Base64 encoding of a binary Apple-generated file. No idea > what it does. You can get the file by piping the following to e.g. {{base64 > -d > something.sql}} > {code:java} > ABRkMDEwMWM2Nl9teVNRTDQwLnNxbAAA > bUJJTgAA > AACCgf+/AAA= > {code} > If this file is name {{{}something.sql{}}}, then Tika will classify it as > {{{}text/x-sql{}}}, which it is not. It seems like more weight is given to > the filename (extension) than the fact that the file is binary anyway. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4172) Apple binary file incorrectly identified as text/x-sql due to filename
[ https://issues.apache.org/jira/browse/TIKA-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789318#comment-17789318 ] Tilman Hausherr commented on TIKA-4172: --- https://tika.apache.org/2.1.0/detection.html "Where the name of the file is known, it is sometimes possible to guess the file type from the name or extension. Within the tika-mimetypes.xml file is a list of patterns which are used to identify the type from the filename. However, because files may be renamed, this method of detection is quick but not always as accurate." > Apple binary file incorrectly identified as text/x-sql due to filename > -- > > Key: TIKA-4172 > URL: https://issues.apache.org/jira/browse/TIKA-4172 > Project: Tika > Issue Type: Bug > Components: general >Affects Versions: 2.9.1 >Reporter: martin k. >Priority: Minor > > This is related to [https://github.com/eikek/docspell/issues/2376] and > [https://github.com/eikek/docspell/issues/2403.] > Take the following Base64 encoding of a binary Apple-generated file. No idea > what it does. You can get the file by piping the following to e.g. {{base64 > -d > something.sql}} > {code:java} > ABRkMDEwMWM2Nl9teVNRTDQwLnNxbAAA > bUJJTgAA > AACCgf+/AAA= > {code} > If this file is name {{{}something.sql{}}}, then Tika will classify it as > {{{}text/x-sql{}}}, which it is not. It seems like more weight is given to > the filename (extension) than the fact that the file is binary anyway. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4172) Apple binary file incorrectly identified as text/x-sql due to filename
[ https://issues.apache.org/jira/browse/TIKA-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789247#comment-17789247 ] martin k. commented on TIKA-4172: - Thanks [~tilman] for your response. I spent some time with Tika 2.9.1 now, and I think I found the issue: {{% curl -T ~/.tmp/d0101c66_mySQL40.sql [http://localhost:9998/meta/Content-Type]}} {{Content-Type,application/octet-stream}} but: {{% curl -T ~/.tmp/d0101c66_mySQL40.sql -H "Content-Disposition: attachment; filename=d0101c66_mySQL40.sql" [http://localhost:9998/meta/Content-Type]}} {{Content-Type,text/x-sql; charset=IBM424}} So it is the filename that is persuading Tika more than the actual contents. Is that intentional? > Apple binary file incorrectly identified as text/x-sql due to filename > -- > > Key: TIKA-4172 > URL: https://issues.apache.org/jira/browse/TIKA-4172 > Project: Tika > Issue Type: Bug > Components: general >Affects Versions: 2.9.1 >Reporter: martin k. >Priority: Minor > > This is related to [https://github.com/eikek/docspell/issues/2376] and > [https://github.com/eikek/docspell/issues/2403.] > Take the following Base64 encoding of a binary Apple-generated file. No idea > what it does. You can get the file by piping the following to e.g. {{base64 > -d > something.sql}} > {code:java} > ABRkMDEwMWM2Nl9teVNRTDQwLnNxbAAA > bUJJTgAA > AACCgf+/AAA= > {code} > If this file is name {{{}something.sql{}}}, then Tika will classify it as > {{{}text/x-sql{}}}, which it is not. It seems like more weight is given to > the filename (extension) than the fact that the file is binary anyway. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4172) Apple binary file incorrectly identified as text/x-sql due to filename
[ https://issues.apache.org/jira/browse/TIKA-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17788982#comment-17788982 ] Tilman Hausherr commented on TIKA-4172: --- Which tika call are you using? Have you tried detecting purely on content? > Apple binary file incorrectly identified as text/x-sql due to filename > -- > > Key: TIKA-4172 > URL: https://issues.apache.org/jira/browse/TIKA-4172 > Project: Tika > Issue Type: Bug > Components: general >Affects Versions: 2.9.1 >Reporter: martin k. >Priority: Minor > > This is related to [https://github.com/eikek/docspell/issues/2376] and > [https://github.com/eikek/docspell/issues/2403.] > Take the following Base64 encoding of a binary Apple-generated file. No idea > what it does. You can get the file by piping the following to e.g. {{base64 > -d > something.sql}} > {code:java} > ABRkMDEwMWM2Nl9teVNRTDQwLnNxbAAA > bUJJTgAA > AACCgf+/AAA= > {code} > If this file is name {{{}something.sql{}}}, then Tika will classify it as > {{{}text/x-sql{}}}, which it is not. It seems like more weight is given to > the filename (extension) than the fact that the file is binary anyway. -- This message was sent by Atlassian Jira (v8.20.10#820010)