As Aaron noted, this is often a problem with search results page using
an incorrect encoding.
I've also seen cases where the pages in question were not tagged
appropriately - e.g. there's a meta tag in the HTML that specifies the
wrong encoding.
I think browsers like Firefox trust this info less than Nutch, so they
do more "sniffing" to determine if the specified encoding is wrong,
and ignore it if so.
-- Ken
On Oct 29, 2009, at 4:05pm, Fadzi Ushewokunze wrote:
hi there,
i am having issues with the HTMLParser failing to detect the char
encoding. so lots of non alpha-numeric chars end up as "?" ;
i dont have any special requirement for any special characters, i am
happy with usual utf-8
any suggestion on the best way to configure this correctly; everything
seems quite ok looking at the code not sure whats missing.
thanks.
--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378