On Oct 30, 2009, at 12:26am, Fadzi Ushewokunze wrote:
interesting - i will try this and let you know because it was set to
windows encoding (why on earth!?)
Because this encoding is only used when neither the server nor the
HTML meta specifies an encoding.
And in those situations, the most common encoding being used is CP-1252.
-- Ken
On Thu, 2009-10-29 at 20:35 -0400, Fuad Efendi wrote:
i dont have any special requirement for any special characters, i
am
happy with usual utf-8
any suggestion on the best way to configure this correctly;
everything
seems quite ok looking at the code not sure whats missing.
Try to set UTF-8 in configuration file:
parser.character.encoding.default = UTF-8
-----Original Message-----
From: Fuad Efendi [mailto:f...@efendi.ca]
Sent: October-29-09 8:19 PM
To: nutch-user@lucene.apache.org; fa...@butterflycluster.net
Subject: RE: char encoding
Is it "?" or "¿" (Inverted Question Mark)?
Because ¿ is replacement for character codes not having
representation in
specific encoding scheme; you may get it, for instance, if binary
stream
is
UTF-8 encoded, and Nutch considers it as Windows-1252. All byte(s)
not
having representation in windows-1252 will be represented as "¿".
Nutch tries on the best effort; however, it can't use dedicated
CPU as
browsers.... I agree with Ken. Browsers may fully ignore headers/
meta and
sniff and analyze byte array to find correct encoding (in case, for
instance, if byte stream is UTF-8, and http/meta is windows-1252).
Nutch
can't do that (it requires a lot of CPU).
Windows-1252 -s default scheme for html-parser in case if Nutch
can't find
correct HTTP/META...
From HtmlParser API:
* We need to do something similar to what's done by mozilla
*
(http://lxr.mozilla.org/seamonkey/source/parser/htmlparser/src/nsParser.cpp#
1993).
* See also http://www.w3.org/TR/REC-xml/#sec-guessing
private static String sniffCharacterEncoding(byte[] content) {...}
- it doesn't currently use HTTP Headers.
- it tries to find META tag in first 2000 bytes.
So, for instance, some weird sites (such as AJAX/Portals) may have
a lot
of
generated JavaScript before META tag; 2000 could be small.
Then, EncodingDetector is called:
detector.addClue(sniffCharacterEncoding(contentInOctets),
"sniffed");
- but it doen't make sense...
public String guessEncoding(Content content, String defaultValue) {
/*
* This algorithm could be replaced by something more
sophisticated;
* ideally we would gather a bunch of data on where various clues
* (autodetect, HTTP headers, HTML meta tags, etc.) disagree,
tag each
with
* the correct answer, and use machine learning/some statistical
method
* to generate a better heuristic.
*/
TODO list... as a workaround, please check for this site that META
could
be
found in first 2000 bytes...
-Fuad
http://www.linkedin.com/in/liferay
-----Original Message-----
From: Fadzi Ushewokunze [mailto:fa...@butterflycluster.net]
Sent: October-29-09 7:05 PM
To: nutch-user@lucene.apache.org
Subject: char encoding
hi there,
i am having issues with the HTMLParser failing to detect the char
encoding. so lots of non alpha-numeric chars end up as "?" ;
i dont have any special requirement for any special characters, i
am
happy with usual utf-8
any suggestion on the best way to configure this correctly;
everything
seems quite ok looking at the code not sure whats missing.
thanks.
--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378