[ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410111#comment-15410111 ]
Shabanali Faghani commented on TIKA-2038: ----------------------------------------- Well, now we can have a fair comparison between Tika and IUST. Note that this comparison is done for the half cases that there is no charset information available in Meta tags (or the user does not trust in the Meta information at all). For the other half both IUST and Tika are in par, though Tika fails in some cases, see TIKA-2050. (IUST failed just on two docs) In my paper I’ve thoroughly ignored the charsets in Meta tags and I didn’t involve it in my computations. But if you want to consider it as true, you should do some changes in the results. Because, now it cannot be interpreted from your test that the behavior of some of these algorithms in real-world would be something like the results that you’ve attached, because the corpus does not properly represents the real-world conditions. So, for each algorithm that looks for charset inside the Meta tags (i.e. 1, 2, 3 in your list above) you should at first turn its Meta detection off and then compute its accuracy then divide its accuracy (that would be <= 1) by 2 and finally add 50% or 0.5 to the division result number. If you do that, the accuracy of the “1. Tika's default detection algorithm” and “3. HTMLEncodingDetector”will be fallen. But I think the accuracy of the “2. The proposed detection algorithm” won’t change (with considering GB18030 as accepted detection for GBK ). The behavior of the last four algorithms in real world, i.e. 4,5,6,7 would be just something like the results that you’ve attached, because they don’t look for charset in Meta tags even if it exists there. p.s. In the primitive steps of my work I tested existing tools against just two encodings including UTF-8 and Windows-1256. Since JUniversalCharDet was totally failed on Windows-1256 and was not perfect for UTF-8 and hence I was thought that it is a poor release of JCharDet I threw it away at the very first steps of my work... and later on I didn’t test it with other encodings. But now it sounds great for Windows-1251, GBK and Shift_JIS. Nevertheless, in detecting UTF-8 it is weaker than what I’ve seen before. > A more accurate facility for detecting Charset Encoding of HTML documents > ------------------------------------------------------------------------- > > Key: TIKA-2038 > URL: https://issues.apache.org/jira/browse/TIKA-2038 > Project: Tika > Issue Type: Improvement > Components: core, detector > Reporter: Shabanali Faghani > Priority: Minor > Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, > iust_encodings.zip, tika_1_14-SNAPSHOT_encoding_detector.zip > > > Currently, Tika uses icu4j for detecting charset encoding of HTML documents > as well as the other naturally text documents. But the accuracy of encoding > detector tools, including icu4j, in dealing with the HTML documents is > meaningfully less than from which the other text documents. Hence, in our > project I developed a library that works pretty well for HTML documents, > which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet > Since Tika is widely used with and within some of other Apache stuffs such as > Nutch, Lucene, Solr, etc. and these projects are strongly in connection with > the HTML documents, it seems that having such an facility in Tika also will > help them to become more accurate. -- This message was sent by Atlassian JIRA (v6.3.4#6332)