[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Tim Allison (JIRA) Mon, 01 Aug 2016 13:52:29 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15401985#comment-15401985
 ]


Tim Allison edited comment on TIKA-2038 at 8/1/16 8:51 PM:
-----------------------------------------------------------

This includes the encodings as detected by: 1) Tika default, 2) HTML alone, 3) 
UniversalCharDet alone, 4) ICU4J alone

There are only 77 files for which the HTML detector is not able to extract an 
encoding in this set.  If we make the assumption that the html meta-header is 
most often correct, and use that as "ground truth" (with caveats!), we see the 
following when comparing to the other two detectors.

Many of these differences don't matter.  Of concern are those where 
WIndows-1251 and Windows-1256 are misidentified.  From a handful of tests, it 
looks like ICU4J gets the correct encoding for those two encodings when we 
remove the markup in the <style/> and <script/> elements.

Comparisons of UniversalCharDet to the HTMLDetector:
||HTMLDetector||UniversalEncodingDetector||Count||
|UTF-8|windows-1252|437|
|windows-1256|windows-1252|340|
|GBK|GB18030|320|
|windows-1256|x-MacCyrillic|159|
|GB2312|GB18030|77|
|windows-1256|NULL|34|
|windows-1256|ISO-8859-1|22|
|windows-1256|ISO-8859-5|17|
|windows-1256|KOI8-R|16|
|UTF-8|ISO-8859-1|16|
|Shift_JIS|NULL|5|
|windows-1252|x-MacCyrillic|5|
|windows-1251|x-MacCyrillic|4|
|GBK|windows-1252|3|
|windows-1256|windows-1255|3|
|Shift_JIS|windows-1252|2|
|UTF-8|GB18030|2|
|windows-1256|UTF-8|2|
|ISO-8859-1|x-MacCyrillic|1|
|UTF-8|windows-1251|1|
|windows-1256|IBM866|1|
|ISO-8859-1|windows-1252|1|
|windows-1256|ISO-8859-7|1|
|windows-1256|ISO-8859-8|1|

Comparisons of ICU4J to the HTMLDetector:
||HTMLDetector||ICU4J||Count||
|UTF-8|ISO-8859-1|465|
|windows-1256|ISO-8859-1|397|
|GBK|GB18030|314|
|windows-1251|ISO-8859-1|232|
|GB2312|GB18030|77|
|windows-1256|windows-1252|10|
|windows-1252|ISO-8859-1|7|
|GBK|ISO-8859-1|7|
|windows-1251|windows-1252|3|
|ISO-8859-1|windows-1252|2|
|UTF-8|GB18030|2|
|windows-1256|ISO-8859-2|2|
|windows-1256|UTF-16LE|1|
|ISO-8859-1|windows-1256|1|
|windows-1256|Big5|1|
|windows-1256|ISO-8859-9|1|
|GBK|windows-1252|1|
|GBK|EUC-KR|1|
|UTF-8|windows-1252|1|




was (Author: talli...@mitre.org):
This includes the encodings as detected by: 1) Tika default, 2) HTML alone, 3) 
UniversalCharDet alone, 4) ICU4J alone

There are only 77 files for which the HTML detector is not able to extract an 
encoding in this set.  If we make the assumption that the html meta-header is 
most often correct, and use that as "ground truth" (with caveats!), we see the 
following when comparing to the other two detectors.

Many of these differences don't matter.  Of concern are those where 
WIndows-1251 and Windows-1256 are misidentified.  From a handful of tests, it 
looks like ICU4J gets the correct encoding for those two encodings when we 
remove the markup.

Comparisons of UniversalCharDet to the HTMLDetector:
||HTMLDetector||UniversalEncodingDetector||Count||
|UTF-8|windows-1252|437|
|windows-1256|windows-1252|340|
|GBK|GB18030|320|
|windows-1256|x-MacCyrillic|159|
|GB2312|GB18030|77|
|windows-1256|NULL|34|
|windows-1256|ISO-8859-1|22|
|windows-1256|ISO-8859-5|17|
|windows-1256|KOI8-R|16|
|UTF-8|ISO-8859-1|16|
|Shift_JIS|NULL|5|
|windows-1252|x-MacCyrillic|5|
|windows-1251|x-MacCyrillic|4|
|GBK|windows-1252|3|
|windows-1256|windows-1255|3|
|Shift_JIS|windows-1252|2|
|UTF-8|GB18030|2|
|windows-1256|UTF-8|2|
|ISO-8859-1|x-MacCyrillic|1|
|UTF-8|windows-1251|1|
|windows-1256|IBM866|1|
|ISO-8859-1|windows-1252|1|
|windows-1256|ISO-8859-7|1|
|windows-1256|ISO-8859-8|1|

Comparisons of ICU4J to the HTMLDetector:
||HTMLDetector||ICU4J||Count||
|UTF-8|ISO-8859-1|465|
|windows-1256|ISO-8859-1|397|
|GBK|GB18030|314|
|windows-1251|ISO-8859-1|232|
|GB2312|GB18030|77|
|windows-1256|windows-1252|10|
|windows-1252|ISO-8859-1|7|
|GBK|ISO-8859-1|7|
|windows-1251|windows-1252|3|
|ISO-8859-1|windows-1252|2|
|UTF-8|GB18030|2|
|windows-1256|ISO-8859-2|2|
|windows-1256|UTF-16LE|1|
|ISO-8859-1|windows-1256|1|
|windows-1256|Big5|1|
|windows-1256|ISO-8859-9|1|
|GBK|windows-1252|1|
|GBK|EUC-KR|1|
|UTF-8|windows-1252|1|



> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: iust_encodings.zip, 
> tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to