[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Shabanali Faghani (JIRA) Thu, 09 Feb 2017 15:46:01 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15860415#comment-15860415
 ]


Shabanali Faghani commented on TIKA-2038:
-----------------------------------------

Attached, the H column is a naive implementation of the idea I’ve proposed 
before. _Starvation_ and _Malnutrition_ are quite obvious for some tlds in this 
column but altogether that properly reflects distribution of the selected tlds 
in Common Crawl. 


Although it’s possible to relieve the problems of this sampling but I think 
that isn’t so important, because as I’ve seen in my evaluations, after just a 
few percent of each tld got processed the accuracy of the all detector 
algorithms got converged. So, I think selecting either method (mine or yours) 
for sampling won't have a meaningful effect on the results, however will a bit 
affect on the weighted aggregated results (see + and * group bars in the 
coarse-grained result of the lang-wise-eval attached files).

bq. Let's put off talk about metaheaders and evaluation until we gather the 
data.

Ok.

bq. I added your the codes you added above and a few others. How does this look?

Looks fine to me, at least at this stage.

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, 
> iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, 
> lang-wise-eval_source_code.zip, proposedTLDSampling.csv, 
> tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html.xlsx
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to