[ 
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515026
 ] 

Doug Cook commented on NUTCH-25:
--------------------------------

OK, I've got more data, and a proposed solution.

I created a test set with a number of problem cases and their correct answers. 
In digging through the "mistakes" the encoding detector made, I found a few 
different root causes. Most of these fell into the following 3 categories.

1) Mixed encodings in the document itself (a "mistake" on the part of the 
author, though there may still be a "right" encoding guess that gets most of 
the document).
     Ex: http://www.franz-keller.de/8860.html (mostly in UTF-8 with one 
ISO-8859-1 "copyright" character in the footer).
     Ex: 
http://www.vinography.com/archives/2006/05/the_rejudgement_of_paris_resul.html 
(mostly UTF-8 with a couple iso-8859-1 arrows in the header)

2) CSS and/or javascript (and maybe HTML tags) throwing off the detector. 
     Ex: 
http://www.systembolaget.se/Uppslagsbok/Kartbok/Italien/NorraItalien/NorraItalien.htm
     Ex: http://www.buscamaniban.com/fr/patrimoine/coeur-armagnac.php

3) The detector having problems with short documents or ones that contain few 
multibyte characters (being statistical, the less data it has, the more 
mistakes it will make).
     Ex: http://forum.winereport.com/ita/index.php?showtopic=1924&st=90 
(detector thinks this is big5 @ 100% confidence)

Solutions: 

I've attached a class, EncodingDetector, that seems to solve most of these 
problems. It also moves the detection code out of the Content class.

Problem 2) The detector has a simple filter for HTML tags, but the 
CharsetDetector documentation strongly recommends writing one's own. So I did 
this; see the stripGarbage function in the EncodingDetector class. It's quick & 
dirty, and clears out much of the garbage that causes detection problems. I'm 
sure it's not perfect, but it seems to do the job.

Problem 3) Detection is inherently imprecise; there will always be errors. But 
I've tried to make it easier to work around them or to build a better heuristic 
"guesser" based upon all the clues we have (not just the text, but the headers 
& metatags). One key is to use detectAll and look at all the possible encodings 
rather than just the first one returned. For example, with the big5 problem 
noted above, the detector got [EMAIL PROTECTED] and also [EMAIL PROTECTED] 
(According to the authors, when multiple detectors tie, they are returned in 
alphabetical order!) EncodingDetector allows different confidence thresholds 
for different encodings (no reason to assume that they all work equally well). 
So one simple workaround is to set the threshold for big5 to 101 (meaning use 
only when there are no other alternatives), and now EncodingDetector returns 
[EMAIL PROTECTED] for this doc; I don't have much big5 in my collection. 

Long-term there are more sophisticated solutions, but I think the high-level 
architecture is right, at any rate: get all the data from CharsetDetector, get 
all the other "clues" (HTTP header, HTML metatags), and combine them flexibly 
to make an overall guess for the doc. This way we're not throwing out any data 
early; we have everything available to the final guessing algorithm (simple 
though the provided one be).

Problem 1) I don't think there's an easy solution to this. But fixing problem 
(2) seemed to improve the performance of problem (1), presumably because th e 
detector is getting cleaner input.

The small test shows significant improvement with the changes. I'm running a 
full test now.

Not sure what the best way to provide this is. I'm attaching a patch for 
TextParser and HtmlParser to use EncodingDetector, though you will likely have 
to apply these by hand, since my local tree is (roughly) 0.8.1 plus a ton of 
local changes. I'll also attach EncodingDetector as a separate file. If this 
doesn't work, or there is an easier way, please let me know; I'm relatively new 
to contributing stuff back, so I may need some coaching. (Also, if there is an 
easy-ish way, that would be good, since I have lots of other local mods that 
are probably generally useful, and I can start contributing those back as I 
have time).


> needs 'character encoding' detector
> -----------------------------------
>
>                 Key: NUTCH-25
>                 URL: https://issues.apache.org/jira/browse/NUTCH-25
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Stefan Groschupf
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-25.patch, NUTCH-25_draft.patch
>
>
> transferred from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356
> submitted by:
> Jungshik Shin
> this is a follow-up to bug 993380 (figure out 'charset'
> from the meta tag).
> Although we can cover a lot of ground using the 'C-T'
> header field in in the HTTP header and the
> corresponding meta tag in html documents (and in case
> of XML, we have to use a similar but a different
> 'parsing'), in the wild, there are a lot of documents
> without any information about the character encoding
> used. Browsers like Mozilla and search engines like
> Google use character encoding detectors to deal with
> these 'unlabelled' documents. 
> Mozilla's character encoding detector is GPL/MPL'd and
> we might be able to port it to Java. Unfortunately,
> it's not fool-proof. However, along with some other
> heuristic used by Mozilla and elsewhere, it'll be
> possible to achieve a high rate of the detection. 
> The following page has links to some other related pages.
> http://trainedmonkey.com/week/2004/26
> In addition to the character encoding detection, we
> also need to detect the language of a document, which
> is even harder and should be a separate bug (although
> it's related).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to