[ https://issues.apache.org/jira/browse/TIKA-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16486979#comment-16486979 ]
Gerard Bouchar edited comment on TIKA-2648 at 5/23/18 9:41 AM: --------------------------------------------------------------- So, would you accept a pull request adding a "nohttp" attribute to glob elements in tika-mimetypes.xml, for instance ? This would give something like {code} <mime-type type="text/x-php"> <_comment>PHP script</_comment> <magic priority="50"> <match value="<?php" type="string" offset="0"/> </magic> <glob nohttp="true" pattern="*.php"/> <glob nohttp="true" pattern="*.php3"/> <glob nohttp="true" pattern="*.php4"/> <sub-class-of type="text/plain"/> </mime-type> {code} And in the code, we would not try to match these patterns if the given resource name starts with "http". was (Author: gbouchar): So, would you accept a pull request adding a "nohttp" attribute to glob elements in tika-mimetypes.xml, for instance ? > mime detection based on resource name detects resources as "text/x-php" > instead of "text/html" > ----------------------------------------------------------------------------------------------- > > Key: TIKA-2648 > URL: https://issues.apache.org/jira/browse/TIKA-2648 > Project: Tika > Issue Type: Bug > Reporter: Gerard Bouchar > Priority: Major > > When using tika to detect a mime type given only an URL containing ".php" and > a content-type hint of "text/html", it guesses "text/x-php", whereas one > could expect "text/html". > {code} > TikaConfig tika = new TikaConfig(); > Metadata metadata = new Metadata(); > String url = "https://www.facebook.com/home.php"; > metadata.set(Metadata.RESOURCE_NAME_KEY, url); > metadata.set(Metadata.CONTENT_TYPE, "text/html"); > MediaType type = tika.getDetector().detect(null, metadata); > System.out.println(url + " is of type " + type.toString()); > // Prints https://www.facebook.com/home.php is of type text/x-php > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)