[jira] [Commented] (TIKA-4172) Apple binary file incorrectly identified as text/x-sql due to filename

2023-11-25 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789647#comment-17789647
 ] 

Tilman Hausherr commented on TIKA-4172:
---


  

  

  

Your file starts with 00 14 64 30.

See also https://www.iana.org/assignments/media-types/application/applefile

No I don't agree, because: what is a "binary" file after all? There is no fixed 
definition for this, it's just a file that hasn't been classified.

> Apple binary file incorrectly identified as text/x-sql due to filename
> --
>
> Key: TIKA-4172
> URL: https://issues.apache.org/jira/browse/TIKA-4172
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 2.9.1
>Reporter: martin k.
>Priority: Minor
>
> This is related to [https://github.com/eikek/docspell/issues/2376] and 
> [https://github.com/eikek/docspell/issues/2403.]
> Take the following Base64 encoding of a binary Apple-generated file. No idea 
> what it does. You can get the file by piping the following to e.g. {{base64 
> -d > something.sql}}
> {code:java}
> ABRkMDEwMWM2Nl9teVNRTDQwLnNxbAAA
> bUJJTgAA
> AACCgf+/AAA=
> {code}
> If this file is name {{{}something.sql{}}}, then Tika will classify it as 
> {{{}text/x-sql{}}}, which it is not. It seems like more weight is given to 
> the filename (extension) than the fact that the file is binary anyway.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4172) Apple binary file incorrectly identified as text/x-sql due to filename

2023-11-24 Thread martin k. (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789621#comment-17789621
 ] 

martin k. commented on TIKA-4172:
-

Right, [~tilman], however, {{application/octet-stream}} is also a valid MIME 
type.

But don't you agree that when a file is classified by content as 
{{application/octet-stream}} and I provide the information that the file just 
so happens to be called {{something.sql}} and has a content-type of 
{{application/applefile}}, then tika certainly shouldn't claim it to be 
{{text/x-sql}}, which it is least of all of them?

Maybe one issue is that the file itself does seem to be corrupt. Libmagic 5.45 
reports it as "MacBinary III INVALID date", but tika doesn't seem to want to 
treat it as {{application/applefile}} at all:

{{% curl -T ~/.tmp/d0101c66_mySQL40.sql -H "Content-Type: 
application/applefile" -H "Content-Disposition: attachment; 
filename=d0101c66_mySQL40.sql" http://localhost:9998/meta/Content-Type}}
{{Failed to get metadata field Content-Type}}

Regardless though, it doesn't seem right to then ignore the fact that the file 
*is* binary, and return a {text/*}} content-type instead. Do you agree?

> Apple binary file incorrectly identified as text/x-sql due to filename
> --
>
> Key: TIKA-4172
> URL: https://issues.apache.org/jira/browse/TIKA-4172
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 2.9.1
>Reporter: martin k.
>Priority: Minor
>
> This is related to [https://github.com/eikek/docspell/issues/2376] and 
> [https://github.com/eikek/docspell/issues/2403.]
> Take the following Base64 encoding of a binary Apple-generated file. No idea 
> what it does. You can get the file by piping the following to e.g. {{base64 
> -d > something.sql}}
> {code:java}
> ABRkMDEwMWM2Nl9teVNRTDQwLnNxbAAA
> bUJJTgAA
> AACCgf+/AAA=
> {code}
> If this file is name {{{}something.sql{}}}, then Tika will classify it as 
> {{{}text/x-sql{}}}, which it is not. It seems like more weight is given to 
> the filename (extension) than the fact that the file is binary anyway.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4172) Apple binary file incorrectly identified as text/x-sql due to filename

2023-11-24 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789542#comment-17789542
 ] 

Tilman Hausherr commented on TIKA-4172:
---

application/octet-stream is defined as the default by the detection interface 
if it doesn't know. tika-mimetypes.xml does't seem to have anything magic that 
matches your file content.

> Apple binary file incorrectly identified as text/x-sql due to filename
> --
>
> Key: TIKA-4172
> URL: https://issues.apache.org/jira/browse/TIKA-4172
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 2.9.1
>Reporter: martin k.
>Priority: Minor
>
> This is related to [https://github.com/eikek/docspell/issues/2376] and 
> [https://github.com/eikek/docspell/issues/2403.]
> Take the following Base64 encoding of a binary Apple-generated file. No idea 
> what it does. You can get the file by piping the following to e.g. {{base64 
> -d > something.sql}}
> {code:java}
> ABRkMDEwMWM2Nl9teVNRTDQwLnNxbAAA
> bUJJTgAA
> AACCgf+/AAA=
> {code}
> If this file is name {{{}something.sql{}}}, then Tika will classify it as 
> {{{}text/x-sql{}}}, which it is not. It seems like more weight is given to 
> the filename (extension) than the fact that the file is binary anyway.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4172) Apple binary file incorrectly identified as text/x-sql due to filename

2023-11-24 Thread martin k. (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789440#comment-17789440
 ] 

martin k. commented on TIKA-4172:
-

Hey [~tilman], I did read over the docs, and I should have stated this. Later 
in the document (well, I read version 2.9.1), under "The default Mime Types 
Detector" it also says:

{quote}Firstly, magic based detection is used on the start of the file. … Next, 
if available, the filename (from TikaCoreProperties.RESOURCE_NAME_KEY) is then 
used to improve the detail of the detection, such as when magic detects a text 
file, and the filename hints it's really a CSV. Finally, if available, the 
supplied content type (from Metadata.CONTENT_TYPE) is used to further refine 
the type.{quote}

This suggests that the filename is only used to refine the detection. However, 
in the case I highlighted, the content-based detection would have detected a 
binary file ({{application/octet-stream}}), which then gets overruled by the 
filename-based "refinement" into a {{text/*}} content type, and I think this 
shouldn't be.

> Apple binary file incorrectly identified as text/x-sql due to filename
> --
>
> Key: TIKA-4172
> URL: https://issues.apache.org/jira/browse/TIKA-4172
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 2.9.1
>Reporter: martin k.
>Priority: Minor
>
> This is related to [https://github.com/eikek/docspell/issues/2376] and 
> [https://github.com/eikek/docspell/issues/2403.]
> Take the following Base64 encoding of a binary Apple-generated file. No idea 
> what it does. You can get the file by piping the following to e.g. {{base64 
> -d > something.sql}}
> {code:java}
> ABRkMDEwMWM2Nl9teVNRTDQwLnNxbAAA
> bUJJTgAA
> AACCgf+/AAA=
> {code}
> If this file is name {{{}something.sql{}}}, then Tika will classify it as 
> {{{}text/x-sql{}}}, which it is not. It seems like more weight is given to 
> the filename (extension) than the fact that the file is binary anyway.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4172) Apple binary file incorrectly identified as text/x-sql due to filename

2023-11-23 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789318#comment-17789318
 ] 

Tilman Hausherr commented on TIKA-4172:
---

https://tika.apache.org/2.1.0/detection.html

"Where the name of the file is known, it is sometimes possible to guess the 
file type from the name or extension. Within the tika-mimetypes.xml file is a 
list of patterns which are used to identify the type from the filename.

However, because files may be renamed, this method of detection is quick but 
not always as accurate."

> Apple binary file incorrectly identified as text/x-sql due to filename
> --
>
> Key: TIKA-4172
> URL: https://issues.apache.org/jira/browse/TIKA-4172
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 2.9.1
>Reporter: martin k.
>Priority: Minor
>
> This is related to [https://github.com/eikek/docspell/issues/2376] and 
> [https://github.com/eikek/docspell/issues/2403.]
> Take the following Base64 encoding of a binary Apple-generated file. No idea 
> what it does. You can get the file by piping the following to e.g. {{base64 
> -d > something.sql}}
> {code:java}
> ABRkMDEwMWM2Nl9teVNRTDQwLnNxbAAA
> bUJJTgAA
> AACCgf+/AAA=
> {code}
> If this file is name {{{}something.sql{}}}, then Tika will classify it as 
> {{{}text/x-sql{}}}, which it is not. It seems like more weight is given to 
> the filename (extension) than the fact that the file is binary anyway.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4172) Apple binary file incorrectly identified as text/x-sql due to filename

2023-11-23 Thread martin k. (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789247#comment-17789247
 ] 

martin k. commented on TIKA-4172:
-

Thanks [~tilman] for your response.

I spent some time with Tika 2.9.1 now, and I think I found the issue:

{{% curl -T ~/.tmp/d0101c66_mySQL40.sql 
[http://localhost:9998/meta/Content-Type]}}
{{Content-Type,application/octet-stream}}

but:

{{% curl -T ~/.tmp/d0101c66_mySQL40.sql -H "Content-Disposition: attachment; 
filename=d0101c66_mySQL40.sql" [http://localhost:9998/meta/Content-Type]}}
{{Content-Type,text/x-sql; charset=IBM424}}

So it is the filename that is persuading Tika more than the actual contents. Is 
that intentional?

> Apple binary file incorrectly identified as text/x-sql due to filename
> --
>
> Key: TIKA-4172
> URL: https://issues.apache.org/jira/browse/TIKA-4172
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 2.9.1
>Reporter: martin k.
>Priority: Minor
>
> This is related to [https://github.com/eikek/docspell/issues/2376] and 
> [https://github.com/eikek/docspell/issues/2403.]
> Take the following Base64 encoding of a binary Apple-generated file. No idea 
> what it does. You can get the file by piping the following to e.g. {{base64 
> -d > something.sql}}
> {code:java}
> ABRkMDEwMWM2Nl9teVNRTDQwLnNxbAAA
> bUJJTgAA
> AACCgf+/AAA=
> {code}
> If this file is name {{{}something.sql{}}}, then Tika will classify it as 
> {{{}text/x-sql{}}}, which it is not. It seems like more weight is given to 
> the filename (extension) than the fact that the file is binary anyway.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4172) Apple binary file incorrectly identified as text/x-sql due to filename

2023-11-22 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17788982#comment-17788982
 ] 

Tilman Hausherr commented on TIKA-4172:
---

Which tika call are you using? Have you tried detecting purely on content?

> Apple binary file incorrectly identified as text/x-sql due to filename
> --
>
> Key: TIKA-4172
> URL: https://issues.apache.org/jira/browse/TIKA-4172
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 2.9.1
>Reporter: martin k.
>Priority: Minor
>
> This is related to [https://github.com/eikek/docspell/issues/2376] and 
> [https://github.com/eikek/docspell/issues/2403.]
> Take the following Base64 encoding of a binary Apple-generated file. No idea 
> what it does. You can get the file by piping the following to e.g. {{base64 
> -d > something.sql}}
> {code:java}
> ABRkMDEwMWM2Nl9teVNRTDQwLnNxbAAA
> bUJJTgAA
> AACCgf+/AAA=
> {code}
> If this file is name {{{}something.sql{}}}, then Tika will classify it as 
> {{{}text/x-sql{}}}, which it is not. It seems like more weight is given to 
> the filename (extension) than the fact that the file is binary anyway.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)