[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515230
]
Doğacan Güney commented on NUTCH-25:
------------------------------------
Overall I think the idea behind EncodingDetector is very solid. I will take a
better look at your patch, but here are a couple of comments after a quick
review:
* EncodingDetector api is way too open. IMO, EncodingClue should be a private
static class (users can pass a clue like detector.addClue(value, source,
confidence)), EncodingDetector should not expose clues ever (for example,
autoDetectClues should return void [or perhaps a boolean indicating the success
of autodetect]) and store clues internally.
* code:
public boolean meetsThreshold() {
Integer mt = (Integer) thresholds.get(value);
int myThreshold = (mt != null) ? mt.intValue() : minConfidence; //
use global value if no encoding-specific value found
return (confidence < 0 || (minConfidence >= 0 &&
confidence>=myThreshold));
}
Why does meetsTreshold return true if confidence < 0?
* If users specify an encoding clue with no confidence then we should give it a
default positive confidence instead of -1. Of course, confidence value needs to
be very very small, maybe just +1.
* It would be nice to "stack" clues. Assume that autodetection returned 2
possible encodings: ISO-8859-1 with 50 confidence and UTF-8 with 45 confidence.
If I add a new clue (say, coming from http header) for UTF-8 with +6
confidence, overall confidence for UTF-8 should now be 51.
* This is mostly my personal nit, but Java 5 style generics would be nice.
About contributing stuff back: The article at
http://wiki.apache.org/nutch/HowToContribute is a good starting point but it
assumes that you will be working on trunk. I am not sure how you can
'forward-port' your changes from an older version besides doing it manually.
One approach may be to first backport a part of the trunk to your local
installation, change the code then do a "diff -pu" (against backported
version). Since trunk contains newer features and bug fixes you will also be
getting them for free this way :).
> needs 'character encoding' detector
> -----------------------------------
>
> Key: NUTCH-25
> URL: https://issues.apache.org/jira/browse/NUTCH-25
> Project: Nutch
> Issue Type: New Feature
> Reporter: Stefan Groschupf
> Assignee: Doğacan Güney
> Fix For: 1.0.0
>
> Attachments: EncodingDetector.java, NUTCH-25.patch,
> NUTCH-25_draft.patch, patch
>
>
> transferred from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356
> submitted by:
> Jungshik Shin
> this is a follow-up to bug 993380 (figure out 'charset'
> from the meta tag).
> Although we can cover a lot of ground using the 'C-T'
> header field in in the HTTP header and the
> corresponding meta tag in html documents (and in case
> of XML, we have to use a similar but a different
> 'parsing'), in the wild, there are a lot of documents
> without any information about the character encoding
> used. Browsers like Mozilla and search engines like
> Google use character encoding detectors to deal with
> these 'unlabelled' documents.
> Mozilla's character encoding detector is GPL/MPL'd and
> we might be able to port it to Java. Unfortunately,
> it's not fool-proof. However, along with some other
> heuristic used by Mozilla and elsewhere, it'll be
> possible to achieve a high rate of the detection.
> The following page has links to some other related pages.
> http://trainedmonkey.com/week/2004/26
> In addition to the character encoding detection, we
> also need to detect the language of a document, which
> is even harder and should be a separate bug (although
> it's related).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers