[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15887801#comment-15887801
 ] 

Shabanali Faghani edited comment on TIKA-2038 at 3/4/17 6:31 PM:
-----------------------------------------------------------------

Perfect reply, [~talli...@mitre.org]. Thank you!
 
bq. The current version of the stripper leaves in <meta > headers if they also 
include "charset". … I included the output of the stripped HTMLMeta detector as 
a sanity check … (/)
 
bq. I figure that we'll be modifying the stripper …
 
We might need the stripper works like a SAX parser, i.e the input should be 
_InputStream_. This is required if we decided to be too conservative about OOM 
error or avoiding from resource wasting for big html files. I know writing a 
perfect _html stream stripper_ with the minimal faults 
(false-negative/positive, exception, …) is very hard. As a SAX parser, TagSoup 
should be able to to do so but there are two problems including _chicken and 
egg_ and _performance_. The former problem can be solved by _ISO-8859-1 
encoding-decoding_ trick but there is no solution for the latter.

For a lightweight SAX-style stripper I think we can ask [Jonathan Hedley| 
https://jhy.io/], the author of Jsoup or someone else in Jsoup’s mailing list 
that if they ever have done a thing like this or could they help us. We may 
also suggest/introduce IUST (the standalone version) to them. IIRC, in Jsoup 
1.6.1-3 (and most likely now) the charset of a page was supposed/considered as 
UTF-8 if the http header didn’t contain any charset or the charset was not 
specified in input.
 
bq. … and possibly IUST.
 
The current version of IUST, i.e htmlchardet-1.0.1, uses _early-termination_ 
for neither JCharDet nor ICU4j! So, we should write a custom version of IUST to 
do so. Nevertheless, I think we can ignore this for the first version because I 
think that haven’t a meaningful effect on the algorithm. In fact I think 
calling the detection methods of JCharDet and ICU4j with InputStream input will 
a bit increase the efficiency in charge of a bit decrease in the accuracy.
 
bq. I didn't use IUST because this was a preliminary run, and I wasn't sure 
which version I should use. The one on github or the proposed modification 
above or both? Let me know which code you'd like me to run.
 
The _modified IUST_ isn’t yet complete. To complete it we must prepare a 
thorough list of languages for which the stripping shouldn’t be done.  These 
languages/tlds are determined by comparing the results of the IUST with and 
without stripping. So, you should run both _htmlchardet-1.0.1.jar_ (IUST whit 
stripping) with _lookInMeta=false_ and the class _IUSTWithoutMarkupElimination_ 
(IUST without stripping) from the [lang-wise-eval source code| 
https://issues.apache.org/jira/secure/attachment/12848364/lang-wise-eval_source_code.zip].
 The accuracy of _modified IUST_ (the pseudo code above) can be computed 
algorithmically by selecting the best from the two for each language/tld .
 
bq. I want to focus on accuracy first. We still have to settle on an eval 
method. But, yes, I do want to look at this. (/)


was (Author: faghani):
Perfect reply, [~talli...@mitre.org]. Thank you!
 
bq. The current version of the stripper leaves in <meta > headers if they also 
include "charset". … I included the output of the stripped HTMLMeta detector as 
a sanity check … (/)
 
bq. I figure that we'll be modifying the stripper …
 
We might need the stripper works like a SAX parser, i.e the input should be 
_InputStream_. This is required if we decided to be too conservative about OOM 
error or avoiding from resource wasting for big html files. I know writing a 
perfect _html stream stripper_ with the minimal faults 
(false-negative/positive, exception, …) is very hard. As a SAX parser, TagSoup 
should be able to to do so but there are two problems including _chicken and 
egg_ and _performance_. The former problem can be solved by _ISO-8859-1 
encoding-decoding_ trick but there is no solution for the latter.

For a lightweight SAX-style stripper I think we can ask [Jonathan Hedley| 
https://jhy.io/], the author of Jsoup or someone else in Jsoup’s mailing list 
that if they ever have done a thing like this or could they help us. We may 
also suggest/introduce IUST (the standalone version) to them. This is quite 
like a gif entitled “_Adding a citation to a paper possibly written by the 
reviewer_” in [phd funnies| http://users.auth.gr/ksiop/phd_funny/index.html], 
mutual scratching!! IIRC, in Jsoup 1.6.1-3 (and most likely now) the charset of 
a page was supposed/considered as UTF-8 if the http header didn’t contain any 
charset or the charset was not specified in input.
 
bq. … and possibly IUST.
 
The current version of IUST, i.e htmlchardet-1.0.1, uses _early-termination_ 
for neither JCharDet nor ICU4j! So, we should write a custom version of IUST to 
do so. Oh, still a lot of works to do … :( Nevertheless, I think we can ignore 
this for the first version because I think that haven’t a meaningful effect on 
the algorithm. In fact I think calling the detection methods of JCharDet and 
ICU4j with InputStream input will a bit increase the efficiency in charge of a 
bit decrease in the accuracy.
 
bq. I didn't use IUST because this was a preliminary run, and I wasn't sure 
which version I should use. The one on github or the proposed modification 
above or both? Let me know which code you'd like me to run.
 
The _modified IUST_ isn’t yet complete. To complete it we must prepare a 
thorough list of languages for which the stripping shouldn’t be done.  These 
languages/tlds are determined by comparing the results of the IUST with and 
without stripping. So, you should run both _htmlchardet-1.0.1.jar_ (IUST whit 
stripping) with _lookInMeta=false_ and the class _IUSTWithoutMarkupElimination_ 
(IUST without stripping) from the [lang-wise-eval source code| 
https://issues.apache.org/jira/secure/attachment/12848364/lang-wise-eval_source_code.zip].
 The accuracy of _modified IUST_ (the pseudo code above) can be computed 
algorithmically by selecting the best from the two for each language/tld .
 
bq. I want to focus on accuracy first. We still have to settle on an eval 
method. But, yes, I do want to look at this. (/)

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, 
> iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, 
> lang-wise-eval_source_code.zip, proposedTLDSampling.csv, 
> tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html_plus_H_column.xlsx, 
> tld_text_html.xlsx
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to