[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Shabanali Faghani (JIRA) Sun, 31 Jul 2016 23:33:29 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15401595#comment-15401595
 ]


Shabanali Faghani edited comment on TIKA-2038 at 8/1/16 6:33 AM:
-----------------------------------------------------------------

I got astonished by these results at first look! Because they are far better 
than what I’ve seen before, I mean when I tested Tika. Then I remembered that 
almost all of the test files in my corpus have charset information in their 
Meta tags… and according to the order of your algorithm, as you've stated in 
the first comment in this issue, it [looks for a charset in Meta 
tags|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java#L103]
 before anything else. Although, this is the same thing that [is done in my 
algorithm|https://github.com/shabanali-faghani/IUST-HTMLCharDet/blob/master/src/main/java/ir/ac/iust/htmlchardet/HTMLCharsetDetector.java#L49]
 but it is optional in my case and for evaluations in my paper (for both 
encoding-wise and language-wise) I called {{detect(byte[] rawHtmlByteSequence, 
boolean... lookInMeta)}} method with {{false}} value for {{lookInMeta}} 
argument, because …

1) it seems that there is no charset information available (neither in HTTP 
header in crawl time nor in Meta tags in offline mode) for almost half of the 
all html documents, see *#primitive URLs* and *#sites with valid charset in 
HTTP header* in [Language-Wise 
Evaluation|https://github.com/shabanali-faghani/IUST-HTMLCharDet#language-wise-evaluation]
 table, and …
2) as you know, for the other half that the charset information is available,  
there is no 100% guarantee that these information are valid.

So, to have a fair evaluation/comparison, the potential charsets in Meta tags 
should not be involved in detection process. Hence, for computing the accuracy 
of Tika-EncodingDetector the first step of your algorithm should be ignored. It 
can be done either …
* by removing ... (for each document in corpus)
** the value of {{content}} attribute that contains {{charset=xyz}} of a meta 
tag, see 
[this|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java#L50],
 and
** the value of {{charset}} attribute of a {{meta}} tag (html5), see 
[this|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java#L52]
* by directly calling the 2’nd and 3’rd steps of your algorithm. (not reliable, 
because there may be some intermediate processes)
* or simply by depending to Tika source code and commenting some codes in it!


was (Author: faghani):
I got astonished by these results at first look! Because they are far better 
than what I’ve seen before, I mean when I tested Tika. Then I remembered that 
almost all of the test files in my corpus have charset information in their 
Meta tags… and according to the order of your algorithm, as you've stated in 
the first comment in this issue, it [looks for a charset in Meta 
tags|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java#L103]
 before anything else. Although, this is the same thing that [is done in my 
algorithm|https://github.com/shabanali-faghani/IUST-HTMLCharDet/blob/master/src/main/java/ir/ac/iust/htmlchardet/HTMLCharsetDetector.java#L49]
 but it is optional in my case and for evaluations in my paper (for both 
encoding-wise and language-wise) I called {{detect(byte[] rawHtmlByteSequence, 
boolean... lookInMeta)}} method with {{false}} value for {{lookInMeta}} 
argument, because …

1) it seems that there is no charset information available (neither in HTTP 
header in crawl time nor in Meta tags in offline mode) for almost half of the 
all html documents, see *#primitive URLs* and *#sites with valid charset in 
HTTP header* in [Language-Wise 
Evaluation|https://github.com/shabanali-faghani/IUST-HTMLCharDet#language-wise-evaluation]
 table, and …
2) as you know, for the other half that the charset information is available,  
there is no 100% guarantee that these information are valid.

So, to have a fair evaluation/comparison, the potential charsets in Meta tags 
should not be involved in detection process. Hence, for computing the accuracy 
of Tika-EncodingDetector the first step of your algorithm should be ignored. It 
can be done either …
* with removing ... (for each document in corpus)
** the value of {{content}} attribute that contains {{charset=xyz}} of a meta 
tag, see 
[this|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java#L50],
 and
** the value of {{charset}} attribute of a {{meta}} tag (html5), see 
[this|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java#L52]
* with directly calling the 2’nd and 3’rd steps of your algorithm. (not 
reliable, because there may be some intermediate processes)
* or simply by depending to Tika source code and commenting some codes in it!

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to