[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16570587#comment-16570587 ]
Tim Allison commented on TIKA-2673: ----------------------------------- [~gbouchar], On the evaluation, it looks like 3 of the files have the same urls: 105,956, but {{segment_big_chrome_charsets.jsonl.xz}} has ~200k... Should I ignore that one? Second point on the evaluation, I really like how you classified "correct", "similar" and "wrong"...this continues to be an ongoing pain, but it is necessary. bq. I think most people want an encoding detector that "just works" by default. Y, I agree. My thinking is that if we migrate to the newer detector, we'd specify it correctly in the SPI file as we do now with html->universal->icu4j. That would then be "just works" by default. Until that point, though, users would have to specify the newer detector, and we can show them that they ought to include icu4j after the newer detector... Let me think about this some more. bq. I can make a pull request for a separate encoding detector using only the BOM. I don't feel strongly about this. Let's wait to see if there's a need. Thank you! > HtmlEncodingDetector doesn't follow the specification > ----------------------------------------------------- > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug > Reporter: Gerard Bouchar > Assignee: Tim Allison > Priority: Major > Fix For: 1.19, 2.0.0 > > Attachments: HtmlEncodingDetectorTest.java, > StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)