[ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison updated TIKA-2038: ------------------------------ Attachment: comparisons_20160803.xlsx I wrote a markup stripper that ignores content in tags, comments, <style> and <script> elements. I then compared: #. Tika's default detection algorithm #. The proposed detection algorithm #. HTMLEncodingDetector #. UniversalEncodingDetector #. UniversalEncodingDetector (on input that had been stripped) #. ICU4J #. ICU4J (on input that had been stripped) After we do some more evaluation, I propose that we move to this order: HTMLEncodingDetector ICU4J with added stripping The performance on ICU4J improves dramatically if we strip the style/script info, and this is in line with [~faghani] et al's finding. Let me know what you think... > A more accurate facility for detecting Charset Encoding of HTML documents > ------------------------------------------------------------------------- > > Key: TIKA-2038 > URL: https://issues.apache.org/jira/browse/TIKA-2038 > Project: Tika > Issue Type: Improvement > Components: core, detector > Reporter: Shabanali Faghani > Priority: Minor > Attachments: comparisons_20160803.xlsx, iust_encodings.zip, > tika_1_14-SNAPSHOT_encoding_detector.zip > > > Currently, Tika uses icu4j for detecting charset encoding of HTML documents > as well as the other naturally text documents. But the accuracy of encoding > detector tools, including icu4j, in dealing with the HTML documents is > meaningfully less than from which the other text documents. Hence, in our > project I developed a library that works pretty well for HTML documents, > which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet > Since Tika is widely used with and within some of other Apache stuffs such as > Nutch, Lucene, Solr, etc. and these projects are strongly in connection with > the HTML documents, it seems that having such an facility in Tika also will > help them to become more accurate. -- This message was sent by Atlassian JIRA (v6.3.4#6332)