[ 
https://issues.apache.org/jira/browse/TIKA-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17417707#comment-17417707
 ] 

Luís Filipe Nassif commented on TIKA-3554:
------------------------------------------

Hi [~tallison], the proposal is to run name hint detection:
- If the mime from name hint has no magic signature associated with it *and no 
signature was found in the file*
- If there was a signature match already *or the file was detected as text* 
(try to specialize it).
- If the mime name hint is defined to always run name detection (TO DO).

I believe it will improve mime detection, but I didn't have time to run 
regression tests to confirm it...

> Detect plain text file as application/zip based on file ext wrong
> -----------------------------------------------------------------
>
>                 Key: TIKA-3554
>                 URL: https://issues.apache.org/jira/browse/TIKA-3554
>             Project: Tika
>          Issue Type: Bug
>          Components: detector, metadata, mime
>    Affects Versions: 1.26
>            Reporter: Krisztián Gyula Tóth
>            Priority: Major
>              Labels: mime-type
>         Attachments: image-2021-09-15-10-33-33-560.png
>
>
> *Update:* Tika detect only gets 3400bytes peeked from the input stream (and 
> the file name) and not the entire file's byte array.
> ----
> *Given* a simple plain text file with the file extension `.zip` and with 
> content `Hello World!`. Example file name: "hello.txt.zip"
> *When* calling the function `tika.detect()` with the file bytes from an 
> `InputStream` using `BufferedInputStream`
> {code:java}
> String detectedMimeType = tika.detect(bytes.get(), fileItem.getName());
> {code}
> *Then* it returns `application/zip` as for the detected MimeType. (Regardless 
> the file's content is in plain text (~12byte), only the file extension 
> contains the `.zip`.)
>  
> Note: The result is the same for file with HTML content, but also 
> having`.zip` as file ext. It’s not a super rare file type that’s hard to 
> detect. So I’d say it’s a bug in Tika.
>  
> *Expected behavior*
> Tika should detect the provided file as a plain text file and return 
> `text/plain` for the detected mime type regardless of the file extension 
> being `.zip`.
>  
> *Suggested solution:*
> Check file signature further on the file extension in case the file ext is 
> `.zip`
> To ensure that the uploaded file is really a zip archive, it should have a 
> matching file signature with one of the following:
>  * 50 4B 03 04
>  * 50 4B 05 06 (empty archive)
>  * 50 4B 07 08 (spanned archive)
> See magic numbers at Wiki page for ZIP file format: 
> [https://en.wikipedia.org/wiki/ZIP_(file_format)|https://en.wikipedia.org/wiki/ZIP_(file_format))]
>  
> *Background info:*
> We are using the `Tika.detect()` to detect the file's mime type on uploading 
> to the server in a Java servlet before saving it for further processing. To 
> ensure that the client-provided file has the expected mime type and accepts 
> only that type of file. In this context, we are working with `ZIP` archives. 
> Users are only allowed to upload zip archives. But, it turned out that Tika 
> cannot detect plain text files and still recognizes them as ZIP archives if 
> the file extension is given as {{`.zip}}`.
> However, there are newer versions of the Apache Tika than what we are 
> currently using is 1.26 this is still an issue in the newer version.
>  
> *How do I investigate this:*
>  # A valid zip archive with filename `archive.zip.txt` where the file 
> extension is `.txt`
>  ** Expectation: Tika should detect the file mime type as `application/zip`
>  ** Result: Provides the expected result. A valid zip archive, but with 
> having the file `.txt` file extension in its name is still detected as 
> `application/zip` successfully.
>  # A valid zip archive with filename, but without the `.zip` file extension.
>  ** Expectation: Tika should detect the file mime type as `application/zip`
>  ** Result: Provides the expected result. A valid zip archive, but without 
> having the file `.zip` file extension in its name is still detected as 
> `application/zip` successfully.
>  #  A common GIF file, but with `.zip` file extension `something.gif.zip`
>  ** Expectation: Tika should detect the file mime type as `image/gif`
>  ** Result: Provides the expected result. A GIF image, but with having the 
> file `.zip` extension is still can be detected as `image/gif`
>  # Any plain text file (can be `HTML` doc or `TEXT`) with filename 
> `myText.zip` where the file extension is `.zip`
>  ** Expectation: Tika should detect the file mime type as 
> `application/octet-stream` in general or `text/plain` or `text/html` 
> depending on the file's content.
>  ** Result: Tika `detect()` **fails**! Detects it as `application/zip`.
>  # Any plain text file (can be `HTML` doc or plain `TEXT`) with filename, but 
> without the file extension.
>  ** Expectation: Tika should detect the file mime type as 
> `application/octet-stream` in general or `text/plain` or `text/html` 
> depending on the file's content.
>  ** Result: Provides the expected result. Detects it as 
> `application/octet-stream`. (So to say it's acceptable for a file without 
> file extension and text content, `text/plain` would be a perfect match)
>  
> |idx|Tika detect file type test case|Pass (Y/N)|Expected|Detected|
> |1.|A valid ZIP archive with `.txt` file 
> ext|Y|application/zip|application/zip|
> |2.|A valid ZIP archive without file ext|Y|application/zip|application/zip|
> |3.|A common binary (GIF) with `.zip` file ext|Y|image/gif|image/gif|
> |4.|A plain text file with `.zip` file ext|N|application/octet-stream (or 
> text/plain)|application/zip|
> |5.|A plain text file without file ext|Y|application/octet-stream (or 
> text/plain)|application/octet-stream|
> *Conclusion*: It turned out that Tika cannot detect plain text files and 
> still recognizes them as ZIP archives if the file extension is given as 
> `.zip`.
> So, I think the issue is with detecting plain text files <--> ZIP archives 
> the most significant in Tika. It can detect other known files/binaries even 
> without the file extension.
>  
> *Visual proof*
> See in attachments.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to