[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Tim Allison (JIRA) Fri, 29 Jul 2016 12:57:48 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15399911#comment-15399911
 ]


Tim Allison edited comment on TIKA-2038 at 7/29/16 7:48 PM:
------------------------------------------------------------

||Subdirectory||Detected by Tika||Count||Percent||
|GBK|   GBK     |323|   77.1%
|GBK|   GB2312| 77| |   
|GBK|   GB18030|        13| |   
|GBK|   UTF-8|  3| |    
|GBK|   windows-1252|   3| |    
|Shift_JIS|     Shift_JIS|      639|    99.8%|
|Shift_JIS|     windows-1252|   1| |    
|UTF-8| UTF-8|  642|    97.7%|
|UTF-8| ISO-8859-1|     11| |   
|UTF-8| windows-1252|   4| |    
|Windows-1251|  windows-1251|   313|    99.7%|
|Windows-1251|  UTF-8|  1| |    
|Windows-1256|  windows-1256|   597|    92.6%|
|Windows-1256|  windows-1252|   24      | |
|Windows-1256|  ISO-8859-1|     10      | |
|Windows-1256|  UTF-8|  7       | |
|Windows-1256|  x-MacCyrillic|  5| |    
|Windows-1256|  IBM866| 1       | |
|Windows-1256|  ISO-8859-5|     1| |    



was (Author: talli...@mitre.org):
||Subdirectory||Detected by Tika||Count||Percent||
|GBK|   GBK     |323|   77.1%
|GBK|   GB2312| 77| |   
|GBK|   GB18030|        13| |   
|GBK|   UTF-8|  3| |    
|GBK|   windows-1252|   3| |    
|Shift_JIS|     Shift_JIS|      639|    99.8%|
|Shift_JIS|     windows-1252|   1| |    
|UTF-8| UTF-8|  642|    97.7%|
|UTF-8| ISO-8859-1|     11| |   
|UTF-8| windows-1252|   4| |    
|Windows-1251|  windows-1251|   313|    99.7%|
|Windows-1251|  UTF-8|  1| |    


> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to