[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Shabanali Faghani (JIRA) Sun, 02 Dec 2018 14:34:00 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698268#comment-16698268
 ]


Shabanali Faghani edited comment on TIKA-2038 at 12/2/18 10:32 PM:
-------------------------------------------------------------------

[~HansBrende] thank you for your interest to IUST and for your great analysis.

With regard to your work here and on TIKA-2771 and also 
[CommonCrawl3|https://wiki.apache.org/tika/CommonCrawl3] and TIKA-2750 by 
[~talli...@apache.org], looks like it's the time to resume this thread.

The algorithm of jchardet is just like what you've described. To make IUST more 
efficient and standalone with no dependency, I did also a small try to separate 
jchardet's UTF-8 detector after my last comment here. If I remember correctly, 
it keeps a small list that is correlated to its detectors and at the end of 
detection process it scans this list to find the best match. So, I thought it's 
impossible to split its UTF-8 detector, because sometimes it might detect the 
charset of a page something other than UTF-8 due to a higher probability in 
precence of UTF-8 in the list. If this would be true, in absence of other 
detectors jchardet will detect these cases as UTF-8 and this means that its 
false-positive for UTF-8 will be increased (true-negative will be decreased), 
... don't know maybe dramatically!

I'll test the false-positive and true-negative of f8 and compare it with 
jchardet. Hope I've been wrong.

I'll take care of this next week ... now I'm on holiday and am typing with my 
mobile phone!


was (Author: faghani):
[~HansBrende] thank you for your interest to IUST and for your great analysis.

With regard to your work here and on Tika-2771 and also 
[CommonCrawl3|https://wiki.apache.org/tika/CommonCrawl3] and Tika-2750 by 
[~talli...@apache.org], looks like it's the time to resume this thread.

The algorithm of jchardet is just like what you've described. To make IUST more 
efficient and standalone with no dependency, I did also a small try to separate 
jchardet's UTF-8 detector after my last comment here. If I remember correctly, 
it keeps a small list that is correlated to its detectors and at the end of 
detection process it scans this list to find the best match. So, I thought it's 
impossible to split its UTF-8 detector, because sometimes it might detect the 
charset of a page something other than UTF-8 due to a higher probability in 
precence of UTF-8 in the list. If this would be true, in absence of other 
detectors jchardet will detect these cases as UTF-8 and this means that its 
false-positive for UTF-8 will be increased (true-negative will be decreased), 
... don't know maybe dramatically!

I'll test the false-positive and true-negative of f8 and compare it with 
jchardet. Hope I've been wrong.

I'll take care of this next week ... now I'm on holiday and am typing with my 
mobile phone!

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, 
> iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, 
> lang-wise-eval_source_code.zip, proposedTLDSampling.csv, 
> tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html.xlsx, 
> tld_text_html_plus_H_column.xlsx
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to