[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Tim Allison (JIRA) Fri, 29 Jul 2016 12:05:47 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15399864#comment-15399864
 ]


Tim Allison commented on TIKA-2038:
-----------------------------------

This would be a major change to Tika, and it could have large consequences.  
I'd like to do more testing and give our community a bit more time to think 
about this before modifying our current methods. 

I have three major concerns at this point:

1) The track record/longevity of the proposed added dependency.  Your code is 
fairly new, and just 6 days ago you fixed a bug that had the detector returning 
"UTF-8" for everything...if I understand correctly.

2) The hit to performance if we have to parse html docs twice.  Have you 
evaluated that?

3) As of yet, we have no comparison between the proposed new method and Tika's 
legacy method.  The evaluation in your paper was a great start (and far better 
than what normally happens when people make decisions about encoding 
detection), but it doesn't prove that we'd get any accuracy gains in Tika.

bq.  I've been wanting to add stripping of html markup because I also found 
that that confuses icu4j.

I was wrong.  After spending some time recently with our ICU4J code, I was 
reminded that it already tries to strip out html markup.  What it fails to 
strip out are the contents of <script/> and <style/> elements.  Perhaps we 
could add some code to do that?

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to