On Oct 30, 2009, at 12:26am, Fadzi Ushewokunze wrote:

interesting - i will try this and let you know because it was set to
windows encoding (why on earth!?)

Because this encoding is only used when neither the server nor the HTML meta specifies an encoding.

And in those situations, the most common encoding being used is CP-1252.

-- Ken


On Thu, 2009-10-29 at 20:35 -0400, Fuad Efendi wrote:
i dont have any special requirement for any special characters, i am
happy with usual utf-8

any suggestion on the best way to configure this correctly; everything
seems quite ok looking at the code not sure whats missing.


Try to set UTF-8 in configuration file:
parser.character.encoding.default = UTF-8



-----Original Message-----
From: Fuad Efendi [mailto:f...@efendi.ca]
Sent: October-29-09 8:19 PM
To: nutch-user@lucene.apache.org; fa...@butterflycluster.net
Subject: RE: char encoding

Is it "?" or "¿" (Inverted Question Mark)?

Because ¿ is replacement for character codes not having representation in specific encoding scheme; you may get it, for instance, if binary stream
is
UTF-8 encoded, and Nutch considers it as Windows-1252. All byte(s) not
having representation in windows-1252 will be represented as "¿".

Nutch tries on the best effort; however, it can't use dedicated CPU as browsers.... I agree with Ken. Browsers may fully ignore headers/ meta and
sniff and analyze byte array to find correct encoding (in case, for
instance, if byte stream is UTF-8, and http/meta is windows-1252). Nutch
can't do that (it requires a lot of CPU).

Windows-1252 -s default scheme for html-parser in case if Nutch can't find
correct HTTP/META...


From HtmlParser API:
  * We need to do something similar to what's done by mozilla
  *

(http://lxr.mozilla.org/seamonkey/source/parser/htmlparser/src/nsParser.cpp#
1993).
  * See also http://www.w3.org/TR/REC-xml/#sec-guessing


private static String sniffCharacterEncoding(byte[] content) {...}

- it doesn't currently use HTTP Headers.
- it tries to find META tag in first 2000 bytes.


So, for instance, some weird sites (such as AJAX/Portals) may have a lot
of
generated JavaScript before META tag; 2000 could be small.

Then, EncodingDetector is called:
     detector.addClue(sniffCharacterEncoding(contentInOctets),
"sniffed");

- but it doen't make sense...


 public String guessEncoding(Content content, String defaultValue) {
   /*
* This algorithm could be replaced by something more sophisticated;
    * ideally we would gather a bunch of data on where various clues
* (autodetect, HTTP headers, HTML meta tags, etc.) disagree, tag each
with
    * the correct answer, and use machine learning/some statistical
method
    * to generate a better heuristic.
    */



TODO list... as a workaround, please check for this site that META could
be
found in first 2000 bytes...



-Fuad
http://www.linkedin.com/in/liferay


-----Original Message-----
From: Fadzi Ushewokunze [mailto:fa...@butterflycluster.net]
Sent: October-29-09 7:05 PM
To: nutch-user@lucene.apache.org
Subject: char encoding

hi there,

i am having issues with the HTMLParser failing to detect the char
encoding. so lots of non alpha-numeric chars end up as "?" ;

i dont have any special requirement for any special characters, i am
happy with usual utf-8

any suggestion on the best way to configure this correctly; everything
seems quite ok looking at the code not sure whats missing.

thanks.









--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378

Reply via email to