[ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15418855#comment-15418855 ]
Tim Allison edited comment on TIKA-2038 at 8/12/16 6:40 PM: ------------------------------------------------------------ bq. But since I haven’t access to a broadband Internet connection Oh, ok. I've been thinking about this a bit more. I think I'd like to sample urls from Common Crawl based on country codes in the urls. I can take care of this in a few weeks. bq. Please send me your markup stripper so I can use it in my code to evaluate your both stripper and proposed algorithm. I'll post that today...if I have time. bq. BTW, what is tika-eval code? Code [here|https://github.com/tballison/tika/tree/TIKA-1302] still needs some work, but it evaluates the output of two runs of Tika and reports on differences in number of exceptions, mime detection diffs, content diff, etc. I was hoping to have time to get this ready for 1.14, but 1.15 is looking more likely. You can see an example of the output of the comparison code [here|https://github.com/tballison/share/blob/master/poi_comparisons/reports_poi_3_15-beta3_reports.zip?raw=true]. was (Author: talli...@mitre.org): bq. But since I haven’t access to a broadband Internet connection Oh, ok. I've been thinking about this a bit more. I think I'd like to sample urls from Common Crawl based on country codes in the urls. I can take care of this in a few weeks. bq. Please send me your markup stripper so I can use it in my code to evaluate your both stripper and proposed algorithm. I'll post that today. bq. BTW, what is tika-eval code? Code [here|https://github.com/tballison/tika/tree/TIKA-1302] still needs some work, but it evaluates the output of two runs of Tika and reports on differences in number of exceptions, mime detection diffs, content diff, etc. I was hoping to have time to get this ready for 1.14, but 1.15 is looking more likely. > A more accurate facility for detecting Charset Encoding of HTML documents > ------------------------------------------------------------------------- > > Key: TIKA-2038 > URL: https://issues.apache.org/jira/browse/TIKA-2038 > Project: Tika > Issue Type: Improvement > Components: core, detector > Reporter: Shabanali Faghani > Priority: Minor > Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, > iust_encodings.zip, tika_1_14-SNAPSHOT_encoding_detector.zip > > > Currently, Tika uses icu4j for detecting charset encoding of HTML documents > as well as the other naturally text documents. But the accuracy of encoding > detector tools, including icu4j, in dealing with the HTML documents is > meaningfully less than from which the other text documents. Hence, in our > project I developed a library that works pretty well for HTML documents, > which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet > Since Tika is widely used with and within some of other Apache stuffs such as > Nutch, Lucene, Solr, etc. and these projects are strongly in connection with > the HTML documents, it seems that having such an facility in Tika also will > help them to become more accurate. -- This message was sent by Atlassian JIRA (v6.3.4#6332)