[
https://issues.apache.org/jira/browse/TIKA-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377542#comment-17377542
]
Tim Allison edited comment on TIKA-3466 at 7/8/21, 6:04 PM:
------------------------------------------------------------
We need to do as much as we can on Tika to get file detection correct.
That said, I worry about letting a browser "execute" untrusted/user-supplied
files without much greater controls in place.
The other issue is that polyglots are an issue in this kind of use case, and we
only pick "the best" file type, we don't currently identify files that can be
both a PDF and zip file, for example. This tool is still getting off the
ground, but maybe something like this would be better:
https://github.com/trailofbits/polyfile ?
To confirm, you want to allow (and execute) XML in the browser but not XHTML or
html? Are there other file types that you want to exclude (e.g. pdf, jpeg)?
was (Author: [email protected]):
We need to do as much as we can on Tika to get file detection correct.
That said, I worry about letting a browser "execute" untrusted/user-supplied
files without much great controls in place.
To confirm, you want to allow (and execute) XML in the browser but not XHTML or
html? Are there other file types that you want to exclude (e.g. pdf, jpeg)?
> Cannot detect mimetype of xhtml file when script is first node instead of html
> ------------------------------------------------------------------------------
>
> Key: TIKA-3466
> URL: https://issues.apache.org/jira/browse/TIKA-3466
> Project: Tika
> Issue Type: Bug
> Components: detector, mime
> Affects Versions: 1.27
> Reporter: Packiaraj Sakkanan
> Priority: Major
>
> mime-type of below xhtml file deduced as 'application/xml' instead of
> 'application/xhtml+xml'
> {code:java}
> <?xml version="1.0" encoding="UTF-8" ?>
> <script xmlns="http://www.w3.org/1999/xhtml"><![CDATA[
> alert(555);
> ]]></script>
> {code}
>
> one possible solution is to add 'script' in tika-mimetypes.xml, like
> {code:java}
> <mime-type type="application/xhtml+xml">
> <!-- The magic priority for xhtml+xml needs to be lower than that of -->
> <!-- files that contain HTML within them, e.g. mime emails -->
> <magic priority="40">
> <match value="<html xmlns=" type="string" offset="0:8192"/>
> </magic>
> <root-XML namespaceURI="http://www.w3.org/1999/xhtml" localName="html"/>
> <root-XML namespaceURI="http://www.w3.org/1999/xhtml" localName="script"/>
> <glob pattern="*.xhtml"/>
> <glob pattern="*.xht"/>
> </mime-type>
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)