[ 
https://issues.apache.org/jira/browse/TIKA-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698800#comment-17698800
 ] 

Patrick Schmidt commented on TIKA-3686:
---------------------------------------

Checking for *jQuery* to detect Javascript seems rather odd, especially 
considering that jQuery's CSS files trigger the same pattern. That's fixing one 
case by breaking another. Often the media type isn't just a nice to have; 
things break when it's wrong.

Surely there must be a better suggestion than to only go by filename instead 
for everything, every time the magic gets it wrong (which you admit it most 
likely will at some point).

It's understandable that in cases with limited information, guesses can be 
wrong. But from a user perspective, classifying a CSS file, that even has the 
.css extension, as Javascript, just because it contains the text "jQuery" at 
the start seems like a bit of a head scratcher. From the comments in 
tika-mimetypes.xml, the jQuery magic is needed to workaround the HTML magic 
(that detects jQuery as HTML). Now we'll have to add more magic on top to avoid 
some CSS files to be detected as Javascript.

This is the mapping we currently added to workaround this issue. Mapping it to 
text/plain triggers Tika to pick a more specific type based on the file 
extension. That way we get to keep the content-based magic for other file types.

 
{code:java}
  <mime-type type="text/plain">
    <magic priority="80">
      <!-- jQuery -->
      <match value="/* jQuery " type="string" offset="0"/>
      <match value="/*! jQuery " type="string" offset="0"/>
      <match value="/*!" type="string" offset="0">
         <match value="* jQuery " offset="4:8"/>
      </match>
    </magic>
  </mime-type>{code}
 

 

> CSS file detected as JavaScript (application/javascript)
> --------------------------------------------------------
>
>                 Key: TIKA-3686
>                 URL: https://issues.apache.org/jira/browse/TIKA-3686
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 2.0.0-ALPHA
>            Reporter: Marius Dumitru Florea
>            Priority: Major
>
> The following CSS file 
> [https://github.com/techlab/jquery-smartwizard/blob/v5.1.1/dist/css/smart_wizard_all.min.css]
>  is detected as {{application/javascript}} using:
> {noformat}
> TikaUtils.detect(InputStream stream, String name)
> {noformat}
> The reason seems to be that the CSS file starts with:
> {noformat}
> /*!
>  * jQuery
> {noformat}
> which matches the "jQuery" entry from 
> [tika-mimetypes.xml|https://github.com/apache/tika/blob/2.3.0/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L348]
>  used by Tika's {{MimeTypes}} detector.
> This is a regression introduced by 
> https://github.com/apache/tika/commit/97699598f000139b1222b785d634b3c8a8e216c7
>  in TIKA-1141 (2.0.0-ALPHA).
> The implications are serious if the mime type returned by Tika is used to set 
> the content type on the HTTP request returning the CSS file to the browser: 
> the browser ignores the CSS.
> FTR, in my case the CSS file is not served directly from the file system but 
> from a WebJar (in this case 
> https://search.maven.org/artifact/org.webjars.npm/smartwizard/5.1.1/jar ) and 
> we're using Tika to determine the type of files requested from the WebJars.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to