[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Shabanali Faghani (JIRA) Fri, 05 Aug 2016 14:37:55 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410111#comment-15410111
 ]


Shabanali Faghani commented on TIKA-2038:
-----------------------------------------

Well, now we can have a fair comparison between Tika and IUST. Note that this 
comparison is done for the half cases that there is no charset information 
available in Meta tags (or the user does not trust in the Meta information at 
all). For the other half both IUST and Tika are in par, though Tika fails in 
some cases, see TIKA-2050. (IUST failed just on two docs)

In my paper I’ve thoroughly ignored the charsets in Meta tags and I didn’t 
involve it in my computations. But if you want to consider it as true, you 
should do some changes in the results. Because, now it cannot be interpreted 
from your test that the behavior of some of these algorithms in real-world 
would be something like the results that you’ve attached, because the corpus 
does not properly represents the real-world conditions. 

So, for each algorithm that looks for charset inside the Meta tags (i.e. 1, 2, 
3 in your list above) you should at first turn its Meta detection off and then 
compute its accuracy then divide its accuracy (that would be <= 1) by 2 and 
finally add 50% or 0.5 to the division result number. If you do that, the 
accuracy of the “1. Tika's default detection algorithm” and “3. 
HTMLEncodingDetector”will be fallen. But I think the accuracy of the “2. The 
proposed detection algorithm” won’t change (with considering GB18030 as 
accepted detection for GBK ).

The behavior of the last four algorithms in real world, i.e. 4,5,6,7 would be 
just something like the results that you’ve attached, because they don’t look 
for charset in Meta tags even if it exists there.

p.s. In the primitive steps of my work I tested existing tools against just two 
encodings including UTF-8 and Windows-1256. Since JUniversalCharDet was totally 
failed on Windows-1256 and was not perfect for UTF-8 and hence I was thought 
that it is a poor release of JCharDet I threw it away at the very first steps 
of my work... and later on I didn’t test it with other encodings. But now it 
sounds great for Windows-1251, GBK  and Shift_JIS. Nevertheless, in detecting 
UTF-8 it is weaker than what I’ve seen before.

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, 
> iust_encodings.zip, tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to