[ 
https://issues.apache.org/jira/browse/TIKA-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562643#comment-17562643
 ] 

Tim Allison commented on TIKA-3811:
-----------------------------------

The NameDetector isn't ever used in default detection in Tika as far as I can 
tell.  MimeTypes is added via hard coding in the DefaultDetector.  MimeTypes 
does both byte/mime detection, and then it applies its own name 
detection/suffix detection.  If the suffix is a specialization of the mime 
detected via bytes, then that is selected.

We could refactor the detectors to a) not hard code mime types and b) not 
include file name detection in mime types, but this would be a major breaking 
change.  Maybe in Tika 3.x?

For now, the best you can do is hide the file name from the detector:

{noformat}
        Tika tika = new Tika();
        Path p = Paths.get("somethingOrOther.vtt");
        Metadata metadata = new Metadata();
        try (InputStream tis = TikaInputStream.get(p)) {
            return tika.detect(tis, metadata);
        }
{noformat}

If you do this, the file name will be taken into account:
{noformat}
        Metadata metadata = new Metadata();
        try (InputStream tis = TikaInputStream.get(p, metadata)) {
{noformat}

> Exclude NameDetector not working for Tika.detect(file)
> ------------------------------------------------------
>
>                 Key: TIKA-3811
>                 URL: https://issues.apache.org/jira/browse/TIKA-3811
>             Project: Tika
>          Issue Type: Bug
>          Components: config, core, detector
>    Affects Versions: 2.3.0
>            Reporter: Giorgiana Ciobanu
>            Priority: Major
>         Attachments: invalid_format.vtt, tika-config_test.xml
>
>
> I need to detect mime type for a file but for security reason I want to 
> exclude the detection by file name extension. 
> I added a tika-config_test.xml (see attached) to my unit test but it still 
> detects file by name extension.
> I attached a test file that is wrongly detected as text/vtt because of the 
> file extension, it should be text/plain in this case.
>  
> The code of my unit test:
> {code:java}
> File file = new 
> File(getClass().getClassLoader().getResource("invalid_format.vtt").getFile());
> TikaConfig tikaConfig = new TikaConfig(this.getClass()
> .getClassLoader()
> .getResourceAsStream("tika-config_test.xml"));
>  
> // returns text/vtt but should be text/plain
> String mimeType = new Tika(tikaConfig).detect(file); 
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to