[jira] [Comment Edited] (TIKA-2758) Possible error charset detection

Ken Krugler (JIRA) Sat, 20 Oct 2018 12:52:48 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16657976#comment-16657976
 ]


Ken Krugler edited comment on TIKA-2758 at 10/20/18 7:51 PM:
-------------------------------------------------------------

At least for the "detroidnews.html" file, I believe the reason why it's 
detected as 8859-1 instead of UTF-8 is that the UTF-8 sequence (the em-dash, 
u2014) comes pretty far down in the document.

 

As per my comment on TIKA-2592, I was hoping that this change wouldn't get 
rolled in until we had run against a corpus, as it's the kind of thing that can 
cause unexpected breakage.

 

As a kinder, gentler approach we could try harder to figure out if we've got an 
invalid but unambiguous charset name in the HTML meta data, and thus map things 
like "utf8" to "UTF-8", versus just calling all of them invalid. I'd guess that 
most browsers do something similar.


was (Author: kkrugler):
At least for the "detroidnews.html" file, I believe the reason why it's 
detected as 8859-1 instead of UTF-8 is that the UTF-8 sequence (the em-dash, 
u2014) comes pretty far down in the document.

 

As per [my 
comment|https://issues.apache.org/jira/browse/TIKA-2592?focusedCommentId=16382330&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16382330]
 on TIKA-2592, I was hoping that this change wouldn't get rolled in until we 
had run against a corpus, as it's the kind of thing that can cause unexpected 
breakage.

 

 

As a kinder, gentler approach we could try harder to figure out if we've got an 
invalid but unambiguous charset name in the HTML meta data, and thus map things 
like "utf8" to "UTF-8", versus just calling all of them invalid. I'd guess that 
most browsers do something similar.

> Possible error charset detection
> --------------------------------
>
>                 Key: TIKA-2758
>                 URL: https://issues.apache.org/jira/browse/TIKA-2758
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.18
>            Reporter: Markus Jelsma
>            Priority: Major
>             Fix For: 1.20
>
>         Attachments: detroidnews.html, independent.html
>
>
> I started to upgrade our SAX parser Tika dependency from 1.17 to 1.19, ran 
> all 995 unit tests and observed three failures, two encoding issues and one 
> other weird thing. The tests use real HTML.
> Where we previously extracted text  such as 'Spokane, Wash. [— The solar' we 
> now got 'Spokane, Wash. [â€" The solar' in one test. The other had 'could 
> take ["weeks, or' but we not get 'could take [â€œweeks, or' extracted. Our 
> tests pass with 1.17 but fail with 1.18 and 1.19.1.
> Attached are the two HTML files.
> Reading our tests again, i see an old note besides the indepedent test 
> complaining about the character encoding being incorrect. It seems somewhere 
> before 1.17 it was faultly just as it is now with 1.18 and higher.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (TIKA-2758) Possible error charset detection

Reply via email to