Auto-detection of HTML fails with common auto-generated template
----------------------------------------------------------------
Key: TIKA-358
URL: https://issues.apache.org/jira/browse/TIKA-358
Project: Tika
Issue Type: Improvement
Affects Versions: 0.5
Reporter: Ken Krugler
Assignee: Ken Krugler
There's a commonly generated HTML document format that fools the auto-detection
code into classifying it as XML.
I've attached one example of this, from http://www.saveums.com/detect.html
Then the XML parser barfs because there's a dangling comment at the end.
In all the cases I've seen, the server returns the right mime-type (text/html),
so perhaps this could be used to disambiguate.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.