[
https://issues.apache.org/jira/browse/TIKA-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416040#comment-17416040
]
Tim Allison commented on TIKA-3554:
-----------------------------------
Shall we close this as not a problem?
> Detect plain text file as application/zip based on file ext wrong
> -----------------------------------------------------------------
>
> Key: TIKA-3554
> URL: https://issues.apache.org/jira/browse/TIKA-3554
> Project: Tika
> Issue Type: Bug
> Components: detector, metadata, mime
> Affects Versions: 1.26
> Reporter: Krisztián Gyula Tóth
> Priority: Major
> Labels: mime-type
> Attachments: image-2021-09-15-10-33-33-560.png
>
>
> *Update:* Tika detect only gets 3400bytes peeked from the input stream (and
> the file name) and not the entire file's byte array.
> ----
> *Given* a simple plain text file with the file extension `.zip` and with
> content `Hello World!`. Example file name: "hello.txt.zip"
> *When* calling the function `tika.detect()` with the file bytes from an
> `InputStream` using `BufferedInputStream`
> {code:java}
> String detectedMimeType = tika.detect(bytes.get(), fileItem.getName());
> {code}
> *Then* it returns `application/zip` as for the detected MimeType. (Regardless
> the file's content is in plain text (~12byte), only the file extension
> contains the `.zip`.)
>
> Note: The result is the same for file with HTML content, but also
> having`.zip` as file ext. It’s not a super rare file type that’s hard to
> detect. So I’d say it’s a bug in Tika.
>
> *Expected behavior*
> Tika should detect the provided file as a plain text file and return
> `text/plain` for the detected mime type regardless of the file extension
> being `.zip`.
>
> *Suggested solution:*
> Check file signature further on the file extension in case the file ext is
> `.zip`
> To ensure that the uploaded file is really a zip archive, it should have a
> matching file signature with one of the following:
> * 50 4B 03 04
> * 50 4B 05 06 (empty archive)
> * 50 4B 07 08 (spanned archive)
> See magic numbers at Wiki page for ZIP file format:
> [https://en.wikipedia.org/wiki/ZIP_(file_format)|https://en.wikipedia.org/wiki/ZIP_(file_format))]
>
> *Background info:*
> We are using the `Tika.detect()` to detect the file's mime type on uploading
> to the server in a Java servlet before saving it for further processing. To
> ensure that the client-provided file has the expected mime type and accepts
> only that type of file. In this context, we are working with `ZIP` archives.
> Users are only allowed to upload zip archives. But, it turned out that Tika
> cannot detect plain text files and still recognizes them as ZIP archives if
> the file extension is given as {{`.zip}}`.
> However, there are newer versions of the Apache Tika than what we are
> currently using is 1.26 this is still an issue in the newer version.
>
> *How do I investigate this:*
> # A valid zip archive with filename `archive.zip.txt` where the file
> extension is `.txt`
> ** Expectation: Tika should detect the file mime type as `application/zip`
> ** Result: Provides the expected result. A valid zip archive, but with
> having the file `.txt` file extension in its name is still detected as
> `application/zip` successfully.
> # A valid zip archive with filename, but without the `.zip` file extension.
> ** Expectation: Tika should detect the file mime type as `application/zip`
> ** Result: Provides the expected result. A valid zip archive, but without
> having the file `.zip` file extension in its name is still detected as
> `application/zip` successfully.
> # A common GIF file, but with `.zip` file extension `something.gif.zip`
> ** Expectation: Tika should detect the file mime type as `image/gif`
> ** Result: Provides the expected result. A GIF image, but with having the
> file `.zip` extension is still can be detected as `image/gif`
> # Any plain text file (can be `HTML` doc or `TEXT`) with filename
> `myText.zip` where the file extension is `.zip`
> ** Expectation: Tika should detect the file mime type as
> `application/octet-stream` in general or `text/plain` or `text/html`
> depending on the file's content.
> ** Result: Tika `detect()` **fails**! Detects it as `application/zip`.
> # Any plain text file (can be `HTML` doc or plain `TEXT`) with filename, but
> without the file extension.
> ** Expectation: Tika should detect the file mime type as
> `application/octet-stream` in general or `text/plain` or `text/html`
> depending on the file's content.
> ** Result: Provides the expected result. Detects it as
> `application/octet-stream`. (So to say it's acceptable for a file without
> file extension and text content, `text/plain` would be a perfect match)
>
> |idx|Tika detect file type test case|Pass (Y/N)|Expected|Detected|
> |1.|A valid ZIP archive with `.txt` file
> ext|Y|application/zip|application/zip|
> |2.|A valid ZIP archive without file ext|Y|application/zip|application/zip|
> |3.|A common binary (GIF) with `.zip` file ext|Y|image/gif|image/gif|
> |4.|A plain text file with `.zip` file ext|N|application/octet-stream (or
> text/plain)|application/zip|
> |5.|A plain text file without file ext|Y|application/octet-stream (or
> text/plain)|application/octet-stream|
> *Conclusion*: It turned out that Tika cannot detect plain text files and
> still recognizes them as ZIP archives if the file extension is given as
> `.zip`.
> So, I think the issue is with detecting plain text files <--> ZIP archives
> the most significant in Tika. It can detect other known files/binaries even
> without the file extension.
>
> *Visual proof*
> See in attachments.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)