[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Shabanali Faghani (JIRA) Thu, 04 Aug 2016 03:23:03 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15407523#comment-15407523
 ]


Shabanali Faghani edited comment on TIKA-2038 at 8/4/16 10:21 AM:
------------------------------------------------------------------

Maybe you’ve got the answer of this question by reading my recent comment, 
anyways for more clarifications…

*First corpus:* Since there was not any benchmark in this context, I’ve wrote a 
simple multi-threaded crawler to collect a fairly small one. I’ve used charset 
information that are available for almost half of the html pages in the HTTP 
header as validity measure. In fact the crawled pages that had charset 
information in their HTTP header were categorized in *corpus* directory by this 
information as subdirectory, e.g. GBK, Windows-1251, etc. (almost half of the 
all requested pages by my crawler), the other half were just simply ignored. 
Since, almost all html pages that HTTP servers provide clients with the 
information about their charset also have charset information in their Meta 
tags, almost all docs in the first corpus have this information, though these 
two information are not necessarily the same!

*Second corpus:* There is no second corpus as you think. That is just a 
collection of 148,297 URLs extracted from Alexa top 1 million sites by using 
[Top Level Domain 
(TLD)|https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains] names 
as the criteria for 8 languages. These URLs are available 
[here|https://github.com/shabanali-faghani/IUST-HTMLCharDet/tree/master/test-data/language-wise]
 (last 8 files, not directories). Again in this evaluation we used charset 
information in HTTP header as the validity measure/ground truth and since this 
information was available only for 85,292 URLs, the rest were ignored.
Some points…
* The actual URLs count that had charset information in HTTP header was greater 
than 85,292 but for the sake of various networking problems some of them were 
failed in fetching
* We didn’t persist these 85,292 pages, because we didn’t need to them anymore 
after the test and I think their estimated aggregate size was at least ~1.7 GIG 
(85,292 * 20 KB = 1,706 GB).


was (Author: faghani):
No. Maybe you’ve got the answer of this question by reading my recent comment, 
anyways for more clarifications…

*First corpus:* Since there was not any benchmark in this context, I’ve wrote a 
simple multi-threaded crawler to collect a fairly small one. I’ve used charset 
information that are available for almost half of the html pages in the HTTP 
header as validity measure. In fact the crawled pages that had charset 
information in their HTTP header were categorized in *corpus* directory by this 
information as subdirectory, e.g. GBK, Windows-1251, etc. (almost half of the 
all requested pages by my crawler), the other half were just simply ignored. 
Since, almost all html pages that HTTP servers provide clients with the 
information about their charset also have charset information in their Meta 
tags, almost all docs in the first corpus have this information, though these 
two information are not necessarily the same!

*Second corpus:* There is no second corpus as you think. That is just a 
collection of 148,297 URLs extracted from Alexa top 1 million sites by using 
[Top Level Domain 
(TLD)|https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains] names 
as the criteria for 8 languages. These URLs are available 
[here|https://github.com/shabanali-faghani/IUST-HTMLCharDet/tree/master/test-data/language-wise]
 (last 8 files, not directories). Again in this evaluation we used charset 
information in HTTP header as the validity measure/ground truth and since this 
information was available only for 85,292 URLs, the rest were ignored.
Some points…
* The actual URLs count that had charset information in HTTP header was greater 
than 85,292 but for the sake of various networking problems some of them were 
failed in fetching
* We didn’t persist these 85,292 pages, because we didn’t need to them anymore 
after the test and I think their estimated aggregate size was at least ~1.7 GIG 
(85,292 * 20 KB = 1,706 GB).

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, iust_encodings.zip, 
> tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to