[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514426
]
Doug Cook commented on NUTCH-25:
--------------------------------
Not sure where this belongs architecturally and aesthetically -- will think
about that.
The relevance test results look good -- overall at least as good as prior.
The histogram of confidence values from ICU4J on a ~60K doc test DB looks
something like:
0-9 6
10-19 440
20-29 2466
30-39 7724
40-49 11372
50-59 10791
60-69 9583
70-79 4519
80-89 4479
90-99 386
I did find a small number of cases where high-ish (>50%) confidence detection
was wrong:
http://viniform.typepad.fr/dn/2006/10/mise_jour_du_cl.html
http://www.buscamaniban.com/fr/patrimoine/coeur-armagnac.php
http://www.lafite.com/en/html/Corporate/1.html
http://www.franz-keller.de/8860.html
http://www.vinesnwines.org/?m=200605
In all these cases, ICU4J guessed Latin-1, while the page was (correctly)
reported(*) or sniffed(*) to be UTF-8. That said, overall ICU4J seems to
perform quite well. In addition to the overall relevance tests, I used a search
for the word fragment "teau," which occurs frequently when the word Château is
parsed with the wrong encoding (making Ch + garbage + teau). Prior to the patch
I saw 102 occurrences; afterwards I saw 69 occurrences. And many of these 69
seemed to be on pages where the page had mixed encodings, or had typos, so it
shows up that way even in the browser. Also, many of the remaining pages were
text files or RSS feeds (parsed by TextParser, which I haven't yet adapted to
use the encoding detection; doing that now).
Architecturally I think we should store the detected encoding AND the
confidence in all cases (even when low), instead of storing it only when the
confidence meets some threshold. That way the decision of which value to use
can be made later, in the parser, which can make a "smart" decision based upon
all the data that's available (detected, sniffed, reported, plus confidence
value on detection). Then, for example, if there is no sniffed or reported
value, we could use the detected value, even if the confidence is low
(especially useful in the TextParser). We could also make decisions like "the
confidence is medium, but the same value is both sniffed and reported, so let's
trust that instead," which might fix some of the detection problem cases.
Hope this all makes sense. I'll keep plugging away at this today and report
back on what I find. Thanks for all the help and quick responses.
Doug
(*) By "reported," I mean in the HTTP header, and by "sniffed," I mean
specified in the page metatags (since this is the term used in the code).
> needs 'character encoding' detector
> -----------------------------------
>
> Key: NUTCH-25
> URL: https://issues.apache.org/jira/browse/NUTCH-25
> Project: Nutch
> Issue Type: New Feature
> Reporter: Stefan Groschupf
> Assignee: Doğacan Güney
> Fix For: 1.0.0
>
> Attachments: NUTCH-25.patch, NUTCH-25_draft.patch
>
>
> transferred from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356
> submitted by:
> Jungshik Shin
> this is a follow-up to bug 993380 (figure out 'charset'
> from the meta tag).
> Although we can cover a lot of ground using the 'C-T'
> header field in in the HTTP header and the
> corresponding meta tag in html documents (and in case
> of XML, we have to use a similar but a different
> 'parsing'), in the wild, there are a lot of documents
> without any information about the character encoding
> used. Browsers like Mozilla and search engines like
> Google use character encoding detectors to deal with
> these 'unlabelled' documents.
> Mozilla's character encoding detector is GPL/MPL'd and
> we might be able to port it to Java. Unfortunately,
> it's not fool-proof. However, along with some other
> heuristic used by Mozilla and elsewhere, it'll be
> possible to achieve a high rate of the detection.
> The following page has links to some other related pages.
> http://trainedmonkey.com/week/2004/26
> In addition to the character encoding detection, we
> also need to detect the language of a document, which
> is even harder and should be a separate bug (although
> it's related).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers