[jira] [Commented] (TIKA-3811) Exclude NameDetector not working for Tika.detect(file)

2022-07-05 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562537#comment-17562537
 ] 

Nick Burch commented on TIKA-3811:
--

You should not be using Apache Tika's detection for anything security related. 
We do not protect against people maliciously adding mime magic near the start 
of the file which still allows the underlying file to be processed by the 
correct application. We err on the side of giving a best-guess answer.

For the "what is this probably" case, Tika is great. For the "what parser is 
most likely to manage to get text out" case, Tika is great. For "what is this 
for certain even if it is malicious" you need a different tool for your 
detection.

See also 
[https://cwiki.apache.org/confluence/display/TIKA/The+Robustness+of+Apache+Tika]
 for advice on running Tika with untrusted input

> Exclude NameDetector not working for Tika.detect(file)
> --
>
> Key: TIKA-3811
> URL: https://issues.apache.org/jira/browse/TIKA-3811
> Project: Tika
>  Issue Type: Bug
>  Components: config, core, detector
>Affects Versions: 2.3.0
>Reporter: Giorgiana Ciobanu
>Priority: Major
> Attachments: invalid_format.vtt, tika-config_test.xml
>
>
> I need to detect mime type for a file but for security reason I want to 
> exclude the detection by file name extension. 
> I added a tika-config_test.xml (see attached) to my unit test but it still 
> detects file by name extension.
> I attached a test file that is wrongly detected as text/vtt because of the 
> file extension, it should be text/plain in this case.
>  
> The code of my unit test:
> {code:java}
> File file = new 
> File(getClass().getClassLoader().getResource("invalid_format.vtt").getFile());
> TikaConfig tikaConfig = new TikaConfig(this.getClass()
> .getClassLoader()
> .getResourceAsStream("tika-config_test.xml"));
>  
> // returns text/vtt but should be text/plain
> String mimeType = new Tika(tikaConfig).detect(file); 
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3811) Exclude NameDetector not working for Tika.detect(file)

2022-07-05 Thread Giorgiana Ciobanu (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562596#comment-17562596
 ] 

Giorgiana Ciobanu commented on TIKA-3811:
-

[~nick] I understand what you are saying and I appreciate the clarification 
around security aspect regarding the Tika implementation.

For what I need, disabling the mime type detection by the file name/extension 
it's enough for now. I suppose it's the NameDetector that needs to be excluded 
from the DefaultDetector in tika config, right?

Tika documentation says it is possible to exclude a detector by configuration 
and in this case would be org.apache.tika.detect.NameDetector . So I was 
expecting that, after excluding NameDetector and using the detect method with a 
File as input parameter, the guessing of the mime type by the file extension to 
be skipped.

> Exclude NameDetector not working for Tika.detect(file)
> --
>
> Key: TIKA-3811
> URL: https://issues.apache.org/jira/browse/TIKA-3811
> Project: Tika
>  Issue Type: Bug
>  Components: config, core, detector
>Affects Versions: 2.3.0
>Reporter: Giorgiana Ciobanu
>Priority: Major
> Attachments: invalid_format.vtt, tika-config_test.xml
>
>
> I need to detect mime type for a file but for security reason I want to 
> exclude the detection by file name extension. 
> I added a tika-config_test.xml (see attached) to my unit test but it still 
> detects file by name extension.
> I attached a test file that is wrongly detected as text/vtt because of the 
> file extension, it should be text/plain in this case.
>  
> The code of my unit test:
> {code:java}
> File file = new 
> File(getClass().getClassLoader().getResource("invalid_format.vtt").getFile());
> TikaConfig tikaConfig = new TikaConfig(this.getClass()
> .getClassLoader()
> .getResourceAsStream("tika-config_test.xml"));
>  
> // returns text/vtt but should be text/plain
> String mimeType = new Tika(tikaConfig).detect(file); 
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3811) Exclude NameDetector not working for Tika.detect(file)

2022-07-05 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562599#comment-17562599
 ] 

Nick Burch commented on TIKA-3811:
--

Maybe [~tallison] has an idea on the config part, he's been working on that 
area lately...

> Exclude NameDetector not working for Tika.detect(file)
> --
>
> Key: TIKA-3811
> URL: https://issues.apache.org/jira/browse/TIKA-3811
> Project: Tika
>  Issue Type: Bug
>  Components: config, core, detector
>Affects Versions: 2.3.0
>Reporter: Giorgiana Ciobanu
>Priority: Major
> Attachments: invalid_format.vtt, tika-config_test.xml
>
>
> I need to detect mime type for a file but for security reason I want to 
> exclude the detection by file name extension. 
> I added a tika-config_test.xml (see attached) to my unit test but it still 
> detects file by name extension.
> I attached a test file that is wrongly detected as text/vtt because of the 
> file extension, it should be text/plain in this case.
>  
> The code of my unit test:
> {code:java}
> File file = new 
> File(getClass().getClassLoader().getResource("invalid_format.vtt").getFile());
> TikaConfig tikaConfig = new TikaConfig(this.getClass()
> .getClassLoader()
> .getResourceAsStream("tika-config_test.xml"));
>  
> // returns text/vtt but should be text/plain
> String mimeType = new Tika(tikaConfig).detect(file); 
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3811) Exclude NameDetector not working for Tika.detect(file)

2022-07-05 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562643#comment-17562643
 ] 

Tim Allison commented on TIKA-3811:
---

The NameDetector isn't ever used in default detection in Tika as far as I can 
tell.  MimeTypes is added via hard coding in the DefaultDetector.  MimeTypes 
does both byte/mime detection, and then it applies its own name 
detection/suffix detection.  If the suffix is a specialization of the mime 
detected via bytes, then that is selected.

We could refactor the detectors to a) not hard code mime types and b) not 
include file name detection in mime types, but this would be a major breaking 
change.  Maybe in Tika 3.x?

For now, the best you can do is hide the file name from the detector:

{noformat}
Tika tika = new Tika();
Path p = Paths.get("somethingOrOther.vtt");
Metadata metadata = new Metadata();
try (InputStream tis = TikaInputStream.get(p)) {
return tika.detect(tis, metadata);
}
{noformat}

If you do this, the file name will be taken into account:
{noformat}
Metadata metadata = new Metadata();
try (InputStream tis = TikaInputStream.get(p, metadata)) {
{noformat}

> Exclude NameDetector not working for Tika.detect(file)
> --
>
> Key: TIKA-3811
> URL: https://issues.apache.org/jira/browse/TIKA-3811
> Project: Tika
>  Issue Type: Bug
>  Components: config, core, detector
>Affects Versions: 2.3.0
>Reporter: Giorgiana Ciobanu
>Priority: Major
> Attachments: invalid_format.vtt, tika-config_test.xml
>
>
> I need to detect mime type for a file but for security reason I want to 
> exclude the detection by file name extension. 
> I added a tika-config_test.xml (see attached) to my unit test but it still 
> detects file by name extension.
> I attached a test file that is wrongly detected as text/vtt because of the 
> file extension, it should be text/plain in this case.
>  
> The code of my unit test:
> {code:java}
> File file = new 
> File(getClass().getClassLoader().getResource("invalid_format.vtt").getFile());
> TikaConfig tikaConfig = new TikaConfig(this.getClass()
> .getClassLoader()
> .getResourceAsStream("tika-config_test.xml"));
>  
> // returns text/vtt but should be text/plain
> String mimeType = new Tika(tikaConfig).detect(file); 
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3811) Exclude NameDetector not working for Tika.detect(file)

2022-07-05 Thread Giorgiana Ciobanu (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562661#comment-17562661
 ] 

Giorgiana Ciobanu commented on TIKA-3811:
-

Thanks [~tallison] ,  I will use detection with input stream as you recommended.

Yes, for a future version, it would be great to not include the file name 
detection in mime types and have that in a separate Detector added to the 
default one , maybe? 

Thanks [~nick] for forwarding this to [~tallison] .

> Exclude NameDetector not working for Tika.detect(file)
> --
>
> Key: TIKA-3811
> URL: https://issues.apache.org/jira/browse/TIKA-3811
> Project: Tika
>  Issue Type: Bug
>  Components: config, core, detector
>Affects Versions: 2.3.0
>Reporter: Giorgiana Ciobanu
>Priority: Major
> Attachments: invalid_format.vtt, tika-config_test.xml
>
>
> I need to detect mime type for a file but for security reason I want to 
> exclude the detection by file name extension. 
> I added a tika-config_test.xml (see attached) to my unit test but it still 
> detects file by name extension.
> I attached a test file that is wrongly detected as text/vtt because of the 
> file extension, it should be text/plain in this case.
>  
> The code of my unit test:
> {code:java}
> File file = new 
> File(getClass().getClassLoader().getResource("invalid_format.vtt").getFile());
> TikaConfig tikaConfig = new TikaConfig(this.getClass()
> .getClassLoader()
> .getResourceAsStream("tika-config_test.xml"));
>  
> // returns text/vtt but should be text/plain
> String mimeType = new Tika(tikaConfig).detect(file); 
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)