[ https://issues.apache.org/jira/browse/TIKA-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658028#comment-16658028 ]
Ken Krugler commented on TIKA-2758: ----------------------------------- [~markus17] - My comment above was about the previous change (from TIKA-2592), which I think was likely a bit to aggressive in treating all non-standard names as ignorable. So I'd be in favor of a revised approach, as per above. > Possible error charset detection > -------------------------------- > > Key: TIKA-2758 > URL: https://issues.apache.org/jira/browse/TIKA-2758 > Project: Tika > Issue Type: Bug > Components: core > Affects Versions: 1.18 > Reporter: Markus Jelsma > Priority: Major > Fix For: 1.20 > > Attachments: detroidnews.html, independent.html > > > I started to upgrade our SAX parser Tika dependency from 1.17 to 1.19, ran > all 995 unit tests and observed three failures, two encoding issues and one > other weird thing. The tests use real HTML. > Where we previously extracted text such as 'Spokane, Wash. [— The solar' we > now got 'Spokane, Wash. [â€" The solar' in one test. The other had 'could > take ["weeks, or' but we not get 'could take [“weeks, or' extracted. Our > tests pass with 1.17 but fail with 1.18 and 1.19.1. > Attached are the two HTML files. > Reading our tests again, i see an old note besides the indepedent test > complaining about the character encoding being incorrect. It seems somewhere > before 1.17 it was faultly just as it is now with 1.18 and higher. -- This message was sent by Atlassian JIRA (v7.6.3#76005)