[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514375
]
Doug Cook commented on NUTCH-25:
--------------------------------
Hi, Doğacan.
My sincere apologies for the slow response, especially given the alacrity with
which you whipped up that patch.
I had to back-port the patch to my 0.81 environment for testing, so I can't
100% guarantee that your patch works as-is on 0.9.
At any rate, in my environment, it seems to work pretty well, at least in my
limited testing, and I didn't see any obvious problems on code review. I was
using a 50% confidence threshold and most of the time the detection code
kicked in (with the correct answer). All of the documents I was having problems
with were fine.
There seemed to be a typo in the patch; there's a try statement missing here,
if I read correctly, but I just put in a try and took out the funky
isTraceEnabled(), and all was well:
- true);
- } catch (SAXException e) {}
+ LOG.isTraceEnabled());
+
parser.setProperty("http://cyberneko.org/html/properties/default-encoding",
defaultCharEncoding);
+
parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset",
true);
+ } catch (SAXException e) {
+ LOG.trace(e);
+ }
My only (minor) suggestion would be to change the LOG.trace statements in
HtmlParser to note how they determined the encoding, e.g.:
if (LOG.isTraceEnabled()) {
LOG.trace(base + ": setting encoding to (DETECTED) " + encoding);
}
That way one can look at the logs and see how often each of the 3 methods
(detection, response header, sniffing) is used.
Thanks again for the patch; it's good stuff, and useful.
> needs 'character encoding' detector
> -----------------------------------
>
> Key: NUTCH-25
> URL: https://issues.apache.org/jira/browse/NUTCH-25
> Project: Nutch
> Issue Type: Wish
> Reporter: Stefan Groschupf
> Priority: Trivial
> Attachments: NUTCH-25_draft.patch
>
>
> transferred from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356
> submitted by:
> Jungshik Shin
> this is a follow-up to bug 993380 (figure out 'charset'
> from the meta tag).
> Although we can cover a lot of ground using the 'C-T'
> header field in in the HTTP header and the
> corresponding meta tag in html documents (and in case
> of XML, we have to use a similar but a different
> 'parsing'), in the wild, there are a lot of documents
> without any information about the character encoding
> used. Browsers like Mozilla and search engines like
> Google use character encoding detectors to deal with
> these 'unlabelled' documents.
> Mozilla's character encoding detector is GPL/MPL'd and
> we might be able to port it to Java. Unfortunately,
> it's not fool-proof. However, along with some other
> heuristic used by Mozilla and elsewhere, it'll be
> possible to achieve a high rate of the detection.
> The following page has links to some other related pages.
> http://trainedmonkey.com/week/2004/26
> In addition to the character encoding detection, we
> also need to detect the language of a document, which
> is even harder and should be a separate bug (although
> it's related).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers