[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Shabanali Faghani (JIRA) Thu, 04 Aug 2016 06:24:32 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15407756#comment-15407756
 ]


Shabanali Faghani commented on TIKA-2038:
-----------------------------------------

Unfortunately, these tests make no sense that how much Tika is accurate now. I 
wish you’d test the algorithm of Tika not sub-components of its algorithm, 
because we want to compare the accuracy of Tika against IUST. Since I’ve used 
the charset in HTTP header as ground truth, if you decide to test Tika again, 
please use sub-directory names as ground truth, as you've used it in your first 
evaluation in this thread… and if you want to do so, turn Meta detection off. 
Also, to show results in a table, please sort the results by using 
sub-directory name (just like your first table in this thread).

p.s. The aggregate sum of the count column in the both tables is less than the 
corpus size!

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, iust_encodings.zip, 
> tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to