[ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15887801#comment-15887801 ]
Shabanali Faghani edited comment on TIKA-2038 at 3/4/17 6:31 PM: ----------------------------------------------------------------- Perfect reply, [~talli...@mitre.org]. Thank you! bq. The current version of the stripper leaves in <meta > headers if they also include "charset". … I included the output of the stripped HTMLMeta detector as a sanity check … (/) bq. I figure that we'll be modifying the stripper … We might need the stripper works like a SAX parser, i.e the input should be _InputStream_. This is required if we decided to be too conservative about OOM error or avoiding from resource wasting for big html files. I know writing a perfect _html stream stripper_ with the minimal faults (false-negative/positive, exception, …) is very hard. As a SAX parser, TagSoup should be able to to do so but there are two problems including _chicken and egg_ and _performance_. The former problem can be solved by _ISO-8859-1 encoding-decoding_ trick but there is no solution for the latter. For a lightweight SAX-style stripper I think we can ask [Jonathan Hedley| https://jhy.io/], the author of Jsoup or someone else in Jsoup’s mailing list that if they ever have done a thing like this or could they help us. We may also suggest/introduce IUST (the standalone version) to them. IIRC, in Jsoup 1.6.1-3 (and most likely now) the charset of a page was supposed/considered as UTF-8 if the http header didn’t contain any charset or the charset was not specified in input. bq. … and possibly IUST. The current version of IUST, i.e htmlchardet-1.0.1, uses _early-termination_ for neither JCharDet nor ICU4j! So, we should write a custom version of IUST to do so. Nevertheless, I think we can ignore this for the first version because I think that haven’t a meaningful effect on the algorithm. In fact I think calling the detection methods of JCharDet and ICU4j with InputStream input will a bit increase the efficiency in charge of a bit decrease in the accuracy. bq. I didn't use IUST because this was a preliminary run, and I wasn't sure which version I should use. The one on github or the proposed modification above or both? Let me know which code you'd like me to run. The _modified IUST_ isn’t yet complete. To complete it we must prepare a thorough list of languages for which the stripping shouldn’t be done. These languages/tlds are determined by comparing the results of the IUST with and without stripping. So, you should run both _htmlchardet-1.0.1.jar_ (IUST whit stripping) with _lookInMeta=false_ and the class _IUSTWithoutMarkupElimination_ (IUST without stripping) from the [lang-wise-eval source code| https://issues.apache.org/jira/secure/attachment/12848364/lang-wise-eval_source_code.zip]. The accuracy of _modified IUST_ (the pseudo code above) can be computed algorithmically by selecting the best from the two for each language/tld . bq. I want to focus on accuracy first. We still have to settle on an eval method. But, yes, I do want to look at this. (/) was (Author: faghani): Perfect reply, [~talli...@mitre.org]. Thank you! bq. The current version of the stripper leaves in <meta > headers if they also include "charset". … I included the output of the stripped HTMLMeta detector as a sanity check … (/) bq. I figure that we'll be modifying the stripper … We might need the stripper works like a SAX parser, i.e the input should be _InputStream_. This is required if we decided to be too conservative about OOM error or avoiding from resource wasting for big html files. I know writing a perfect _html stream stripper_ with the minimal faults (false-negative/positive, exception, …) is very hard. As a SAX parser, TagSoup should be able to to do so but there are two problems including _chicken and egg_ and _performance_. The former problem can be solved by _ISO-8859-1 encoding-decoding_ trick but there is no solution for the latter. For a lightweight SAX-style stripper I think we can ask [Jonathan Hedley| https://jhy.io/], the author of Jsoup or someone else in Jsoup’s mailing list that if they ever have done a thing like this or could they help us. We may also suggest/introduce IUST (the standalone version) to them. This is quite like a gif entitled “_Adding a citation to a paper possibly written by the reviewer_” in [phd funnies| http://users.auth.gr/ksiop/phd_funny/index.html], mutual scratching!! IIRC, in Jsoup 1.6.1-3 (and most likely now) the charset of a page was supposed/considered as UTF-8 if the http header didn’t contain any charset or the charset was not specified in input. bq. … and possibly IUST. The current version of IUST, i.e htmlchardet-1.0.1, uses _early-termination_ for neither JCharDet nor ICU4j! So, we should write a custom version of IUST to do so. Oh, still a lot of works to do … :( Nevertheless, I think we can ignore this for the first version because I think that haven’t a meaningful effect on the algorithm. In fact I think calling the detection methods of JCharDet and ICU4j with InputStream input will a bit increase the efficiency in charge of a bit decrease in the accuracy. bq. I didn't use IUST because this was a preliminary run, and I wasn't sure which version I should use. The one on github or the proposed modification above or both? Let me know which code you'd like me to run. The _modified IUST_ isn’t yet complete. To complete it we must prepare a thorough list of languages for which the stripping shouldn’t be done. These languages/tlds are determined by comparing the results of the IUST with and without stripping. So, you should run both _htmlchardet-1.0.1.jar_ (IUST whit stripping) with _lookInMeta=false_ and the class _IUSTWithoutMarkupElimination_ (IUST without stripping) from the [lang-wise-eval source code| https://issues.apache.org/jira/secure/attachment/12848364/lang-wise-eval_source_code.zip]. The accuracy of _modified IUST_ (the pseudo code above) can be computed algorithmically by selecting the best from the two for each language/tld . bq. I want to focus on accuracy first. We still have to settle on an eval method. But, yes, I do want to look at this. (/) > A more accurate facility for detecting Charset Encoding of HTML documents > ------------------------------------------------------------------------- > > Key: TIKA-2038 > URL: https://issues.apache.org/jira/browse/TIKA-2038 > Project: Tika > Issue Type: Improvement > Components: core, detector > Reporter: Shabanali Faghani > Priority: Minor > Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, > iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, > lang-wise-eval_source_code.zip, proposedTLDSampling.csv, > tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html_plus_H_column.xlsx, > tld_text_html.xlsx > > > Currently, Tika uses icu4j for detecting charset encoding of HTML documents > as well as the other naturally text documents. But the accuracy of encoding > detector tools, including icu4j, in dealing with the HTML documents is > meaningfully less than from which the other text documents. Hence, in our > project I developed a library that works pretty well for HTML documents, > which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet > Since Tika is widely used with and within some of other Apache stuffs such as > Nutch, Lucene, Solr, etc. and these projects are strongly in connection with > the HTML documents, it seems that having such an facility in Tika also will > help them to become more accurate. -- This message was sent by Atlassian JIRA (v6.3.15#6346)