[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15857926#comment-15857926
 ] 

Tim Allison edited comment on TIKA-2038 at 2/8/17 12:29 PM:
------------------------------------------------------------

bq. Since it seems that in this test the potential charset in meta headers is 
the only available thing that we can use as “ground truth”, if we use the 
HtmlEncodingDetector class of Tika (with META_TAG_BUFFER_SIZE field that is set 
to Integer.MAX_VALUE), in addition to extract potential charsets from meta 
headers, it implicitly will act as a html filter.

In the above sql/proposal, the mime is what was returned in the actual http 
headers, as recorded by CommonCrawl.  They are still somewhat noisy.  Let's put 
off talk about metaheaders and evaluation until we gather the data.

In the attached, I applied a "dominant" language code to each country.  For 
countries with multiple "dominant" languages, I used the country code ("in" -> 
"in").  This is a very rough attempt to get decent coverage of languages.  I 
then calculate how many pages from each country we'd want to collect to get 
roughly 50k per language.

I added your the codes you added above and a few others.  How does this look?



was (Author: talli...@mitre.org):
bq. Since it seems that in this test the potential charset in meta headers is 
the only available thing that we can use as “ground truth”, if we use the 
HtmlEncodingDetector class of Tika (with META_TAG_BUFFER_SIZE field that is set 
to Integer.MAX_VALUE), in addition to extract potential charsets from meta 
headers, it implicitly will act as a html filter.

In the above sql/proposal, the mime is what was returned in the actual http 
headers, as recorded by CommonCrawl.  They are still somewhat noisy.  Let's put 
off talk about metaheaders and evaluation until we gather the data.

In the attached, I applied a "dominant" language code to each country.  For 
countries with multiple "dominant" languages, I used the country code ("in" -> 
"in").  This is a very rough attempt to get decent coverage of languages.  I 
then calculate how many pages from each country we'd want to collect to get 
roughly 50k per language.

I added your country codes and a few others.  How does this look?


> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, 
> iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, 
> lang-wise-eval_source_code.zip, proposedTLDSampling.csv, 
> tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html.xlsx
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to