[
https://issues.apache.org/jira/browse/TIKA-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16075006#comment-16075006
]
Luis Filipe Nassif edited comment on TIKA-2419 at 7/5/17 4:03 PM:
------------------------------------------------------------------
Hi Nick,
The original issue of eml/emlx being detected as html I solved increasing the
magic priority of eml/emlx instead of decreasing html priority. Maybe that is a
possible simpler approach.
was (Author: lfcnassif):
Hi Nick,
The original issue of eml(x) being detected as html I solved increasing the
magic priority of eml(x) instead of decreasing html priority. Maybe that is a
possible simpler approach.
> Try HTML mime magic on broken XML files
> ---------------------------------------
>
> Key: TIKA-2419
> URL: https://issues.apache.org/jira/browse/TIKA-2419
> Project: Tika
> Issue Type: Bug
> Components: mime
> Affects Versions: 1.15
> Reporter: Nick Burch
>
> As noticed from the latest common crawl work, some url-hosted HTML files are
> being detected as text/plain then specialised out to their programming
> language url extension
> This is caused broken XML in the HTML, and by us having dropped the magic
> priority of HTML to 40 (below that of XML), to avoid it matching for
> HTML-containing other types like emails. Because these files have broken XML
> (eg an empty encoding on the xml tag), the XML root extractor doesn't run,
> and they get downmixed to text plain then specialised by filename
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)