[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

Tim Allison (JIRA) Mon, 06 Aug 2018 11:20:15 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16570587#comment-16570587
 ]


Tim Allison commented on TIKA-2673:
-----------------------------------

[~gbouchar], On the evaluation, it looks like 3 of the files have the same 
urls: 105,956, but {{segment_big_chrome_charsets.jsonl.xz}} has ~200k...  
Should I ignore that one?  Second point on the evaluation, I really like how 
you classified "correct", "similar" and "wrong"...this continues to be an 
ongoing pain, but it is necessary.

bq. I think most people want an encoding detector that "just works" by default.
Y, I agree.  My thinking is that if we migrate to the newer detector, we'd 
specify it correctly in the SPI file as we do now with html->universal->icu4j.  
That would then be "just works" by default.  Until that point, though, users 
would have to specify the newer detector, and we can show them that they ought 
to include icu4j after the newer detector... Let me think about this some more.

bq.  I can make a pull request for a separate encoding detector using only the 
BOM. 
I don't feel strongly about this.  Let's wait to see if there's a need.  Thank 
you!

> HtmlEncodingDetector doesn't follow the specification
> -----------------------------------------------------
>
>                 Key: TIKA-2673
>                 URL: https://issues.apache.org/jira/browse/TIKA-2673
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Gerard Bouchar
>            Assignee: Tim Allison
>            Priority: Major
>             Fix For: 1.19, 2.0.0
>
>         Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

Reply via email to