[ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15401715#comment-15401715 ]
Shabanali Faghani edited comment on TIKA-2038 at 8/1/16 8:45 AM: ----------------------------------------------------------------- OK, so to give more details about my library to this community and also in response to your concerns I would to say: 1) You are right, my repo on github is fairly new (less than 1 year) but its algorithm is not new. I developed this library 4 years ago in order to be used in a large-scale project… and it works well from that time till now. It was under a load of ~1.2 billion pages in the peak time. The bug that I’ve fixed in last week was just a tiny mistake that occurred during refactoring the code before the first release. 2) Since the accuracy was much more important than performance for us, I haven’t done a thorough performance test. Nevertheless, bellow I’ve provided the results of a small test that was done on my laptop (intel core i3, java 6, Xmx: default (don’t care)): ||Subdirectory||#docs||Total Size (KB)||Average Size (KB)||Detection Time (millisecond)||Average Time (millisecond)|| |UTF-8|657|32,216|49|26658|40| |Windows-1251|314|30,941|99|4423|14| |GBK|419|43,374|104|20317|48| |Windows-1256|645|66,592|103|9451|14| |Shift_JIS|640|25,973|41|7617|11| Let’s have a little bit more precise look at these results. Due to the logic of my algorithm, for the first row of this table, i.e UTF-8, only Mozilla JCharDet was used (no JSoup and no ICU4J). But as you can see from this table the required time is greater than the three other cases for which documents were parsed using JSoup and also the both JCharDet and ICU4J were involved in detection prrocess. It means that if the encoding of a page is UTF-8 the required time for a positive response from Mozilla JChardet is often greater than the required time for … * get a negative response from Mozilla JChardet + * decode the input byte array using “ISO-8859-1”+ * parse that doc and creating DOM tree + * extract text from DOM tree + * encode the extracted text using “ISO-8859-1” + * and detecting its encoding by using icu4j … when the encoding of a page is not UTF-8!! In brief, 40 > 14, 11 , … in the above table. Now let’s have a look at the [distribution of character encodings for websites| https://w3techs.com/technologies/history_overview/character_encoding]. Since ~87% of all websites use UTF-8, if we use a statistical computation for computing the weighted average time for detecting encoding of a custom html document, I think we would get a similar estimate for both IUST-HTMLCharDet and Tika-EncodingDetector. Because, this estimate is strongly biased by Mozilla JCharDet and as we know this tool is used in the both algorithms in a similar way. Nevertheless, for performance optimizations I will do some tests for … * using a Regex instead of navigating in DOM tree for seeking charsets in Meta tags * stripping HTML Markups, Scripts and embedded CSSs directly instead of using a html parser 3) For computing the accuracy of Tika's legacy method I’ve provided a comment below your current evaluation results. As I’ve explained there, the results of your current evaluation couldn’t be compared with my evaluation. bq. Perhaps we could add some code to do that? Of course, but from experience when I use open sources in my projects, due to the versioning and updating considerations I don’t shift my code into them unless there wouldn’t be any other suitable solution/option. But pulling a part of their code into my projects is *another story*! :) was (Author: faghani): OK, so to give more details about my library to this community and also in response to your concerns I would to say: 1) You are right, my repo on github is fairly new (less than 1 year) but its algorithm is not new. I developed this library 4 years ago in order to be used in a large-scale project… and it works well from that time till now. It was under a load of ~1.2 billion pages in the peak time. The bug that I’ve fixed in last week was just a tiny mistake that occurred during refactoring the code before the first release. 2) Since the accuracy was much more important than performance for us, I haven’t done a thorough performance test. Nevertheless, bellow I’ve provided the results of a small test that was done on my laptop (intel core i3, java 6, Xmx: default (don’t care)): ||Subdirectory||#docs||Total Size (KB)||Average Size (KB)||Detection Time (millisecond)||Average Time (millisecond)|| |UTF-8|657|32,216|49|26658|40| |Windows-1251|314|30,941|99|4423|14| |GBK|419|43,374|104|20317|48| |Windows-1256|645|66,592|103|9451|14| |Shift_JIS|640|25,973|41|7617|11| Let’s have a little bit more precise look at these results. Due to the logic of my algorithm, for the first row of this table, i.e UTF-8, only Mozilla JCharDet was used (no JSoup and no ICU4J). But as you can see from this table the required time is greater than the three other cases for which documents were parsed using JSoup and also the both JCharDet and ICU4J were involved in detection prrocess. It means that if the encoding of a page is UTF-8 the required time for a positive response from Mozilla JChardet is often greater than the required time for … * get a negative response from Mozilla JChardet + * decode the input byte array using “ISO-8859-1”+ * parse that doc and creating DOM tree + * extract text from DOM tree + * encode the extracted text using “ISO-8859-1” + * and detecting its encoding by using icu4j … when the encoding of a page is not UTF-8!! In brief, 40 > 14, 11 , … in the above table. Now let’s have a look at the [distribution of character encodings for websites| https://w3techs.com/technologies/history_overview/character_encoding]. Since ~87% of all websites use UTF-8, if we use a statistical computation for computing the weighted average time for detecting encoding of a custom html document, I think we would get a similar estimate for both IUST-HTMLCharDet and Tika-EncodingDetector. Because, this estimate is strongly biased by Mozilla JCharDet and as we know this tool is used in the both algorithms in a similar way. Nevertheless, for performance optimizations I will do some tests for … * using a Regex instead of navigating in DOM tree for seeking charsets in Meta tags * stripping HTML Markups, Scripts and embedded CSSs directly instead of using a html parser 3) For computing the accuracy of Tika's legacy method I’ve provided a comment below your current evaluation results. As I’ve explained there, the results of your current evaluation couldn’t be compared with my evaluation. bq. Perhaps we could add some code to do that? Of course, but from experience when I use open sources in my projects, due to the versioning and updating considerations I don’t shift my code into them unless there wouldn’t be any other suitable solution/option. But pulling other code into my projects is *another story*! :) > A more accurate facility for detecting Charset Encoding of HTML documents > ------------------------------------------------------------------------- > > Key: TIKA-2038 > URL: https://issues.apache.org/jira/browse/TIKA-2038 > Project: Tika > Issue Type: Improvement > Components: core, detector > Reporter: Shabanali Faghani > Priority: Minor > Attachments: tika_1_14-SNAPSHOT_encoding_detector.zip > > > Currently, Tika uses icu4j for detecting charset encoding of HTML documents > as well as the other naturally text documents. But the accuracy of encoding > detector tools, including icu4j, in dealing with the HTML documents is > meaningfully less than from which the other text documents. Hence, in our > project I developed a library that works pretty well for HTML documents, > which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet > Since Tika is widely used with and within some of other Apache stuffs such as > Nutch, Lucene, Solr, etc. and these projects are strongly in connection with > the HTML documents, it seems that having such an facility in Tika also will > help them to become more accurate. -- This message was sent by Atlassian JIRA (v6.3.4#6332)