[ 
https://issues.apache.org/jira/browse/TIKA-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416039#comment-17416039
 ] 

Tim Allison commented on TIKA-3554:
-----------------------------------

I second [~nick]'s recommendation to wrap your stream in a TikaInputStream...it 
is more efficient for spooling to disk when necessary, although it probably 
won't make much of a difference for detection on truncated files.  The real 
benefits come from detection+parsing...I think.

I'd never trust a file name from someone on the web.  So, y, don't show the 
filename to Tika.

I doubt this is a concern for you, but beware of differential attacks.  If 
you're truncating the file, Tika will only be able to detect zip, not the 
subtype.  So, someone could send a kmz, docx, etc.  If you have a vulnerable 
parser that somehow handles those file types, the truncated file can be 
detected as zip, get through the scan and then wreak havoc when later detected 
as kmz at parse time.  This does not sound like a threat to you, but it is a 
real issue in some applications.

> Detect plain text file as application/zip based on file ext wrong
> -----------------------------------------------------------------
>
>                 Key: TIKA-3554
>                 URL: https://issues.apache.org/jira/browse/TIKA-3554
>             Project: Tika
>          Issue Type: Bug
>          Components: detector, metadata, mime
>    Affects Versions: 1.26
>            Reporter: Krisztián Gyula Tóth
>            Priority: Major
>              Labels: mime-type
>         Attachments: image-2021-09-15-10-33-33-560.png
>
>
> *Update:* Tika detect only gets 3400bytes peeked from the input stream (and 
> the file name) and not the entire file's byte array.
> ----
> *Given* a simple plain text file with the file extension `.zip` and with 
> content `Hello World!`. Example file name: "hello.txt.zip"
> *When* calling the function `tika.detect()` with the file bytes from an 
> `InputStream` using `BufferedInputStream`
> {code:java}
> String detectedMimeType = tika.detect(bytes.get(), fileItem.getName());
> {code}
> *Then* it returns `application/zip` as for the detected MimeType. (Regardless 
> the file's content is in plain text (~12byte), only the file extension 
> contains the `.zip`.)
>  
> Note: The result is the same for file with HTML content, but also 
> having`.zip` as file ext. It’s not a super rare file type that’s hard to 
> detect. So I’d say it’s a bug in Tika.
>  
> *Expected behavior*
> Tika should detect the provided file as a plain text file and return 
> `text/plain` for the detected mime type regardless of the file extension 
> being `.zip`.
>  
> *Suggested solution:*
> Check file signature further on the file extension in case the file ext is 
> `.zip`
> To ensure that the uploaded file is really a zip archive, it should have a 
> matching file signature with one of the following:
>  * 50 4B 03 04
>  * 50 4B 05 06 (empty archive)
>  * 50 4B 07 08 (spanned archive)
> See magic numbers at Wiki page for ZIP file format: 
> [https://en.wikipedia.org/wiki/ZIP_(file_format)|https://en.wikipedia.org/wiki/ZIP_(file_format))]
>  
> *Background info:*
> We are using the `Tika.detect()` to detect the file's mime type on uploading 
> to the server in a Java servlet before saving it for further processing. To 
> ensure that the client-provided file has the expected mime type and accepts 
> only that type of file. In this context, we are working with `ZIP` archives. 
> Users are only allowed to upload zip archives. But, it turned out that Tika 
> cannot detect plain text files and still recognizes them as ZIP archives if 
> the file extension is given as {{`.zip}}`.
> However, there are newer versions of the Apache Tika than what we are 
> currently using is 1.26 this is still an issue in the newer version.
>  
> *How do I investigate this:*
>  # A valid zip archive with filename `archive.zip.txt` where the file 
> extension is `.txt`
>  ** Expectation: Tika should detect the file mime type as `application/zip`
>  ** Result: Provides the expected result. A valid zip archive, but with 
> having the file `.txt` file extension in its name is still detected as 
> `application/zip` successfully.
>  # A valid zip archive with filename, but without the `.zip` file extension.
>  ** Expectation: Tika should detect the file mime type as `application/zip`
>  ** Result: Provides the expected result. A valid zip archive, but without 
> having the file `.zip` file extension in its name is still detected as 
> `application/zip` successfully.
>  #  A common GIF file, but with `.zip` file extension `something.gif.zip`
>  ** Expectation: Tika should detect the file mime type as `image/gif`
>  ** Result: Provides the expected result. A GIF image, but with having the 
> file `.zip` extension is still can be detected as `image/gif`
>  # Any plain text file (can be `HTML` doc or `TEXT`) with filename 
> `myText.zip` where the file extension is `.zip`
>  ** Expectation: Tika should detect the file mime type as 
> `application/octet-stream` in general or `text/plain` or `text/html` 
> depending on the file's content.
>  ** Result: Tika `detect()` **fails**! Detects it as `application/zip`.
>  # Any plain text file (can be `HTML` doc or plain `TEXT`) with filename, but 
> without the file extension.
>  ** Expectation: Tika should detect the file mime type as 
> `application/octet-stream` in general or `text/plain` or `text/html` 
> depending on the file's content.
>  ** Result: Provides the expected result. Detects it as 
> `application/octet-stream`. (So to say it's acceptable for a file without 
> file extension and text content, `text/plain` would be a perfect match)
>  
> |idx|Tika detect file type test case|Pass (Y/N)|Expected|Detected|
> |1.|A valid ZIP archive with `.txt` file 
> ext|Y|application/zip|application/zip|
> |2.|A valid ZIP archive without file ext|Y|application/zip|application/zip|
> |3.|A common binary (GIF) with `.zip` file ext|Y|image/gif|image/gif|
> |4.|A plain text file with `.zip` file ext|N|application/octet-stream (or 
> text/plain)|application/zip|
> |5.|A plain text file without file ext|Y|application/octet-stream (or 
> text/plain)|application/octet-stream|
> *Conclusion*: It turned out that Tika cannot detect plain text files and 
> still recognizes them as ZIP archives if the file extension is given as 
> `.zip`.
> So, I think the issue is with detecting plain text files <--> ZIP archives 
> the most significant in Tika. It can detect other known files/binaries even 
> without the file extension.
>  
> *Visual proof*
> See in attachments.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to