As Aaron noted, this is often a problem with search results page using an incorrect encoding.

I've also seen cases where the pages in question were not tagged appropriately - e.g. there's a meta tag in the HTML that specifies the wrong encoding.

I think browsers like Firefox trust this info less than Nutch, so they do more "sniffing" to determine if the specified encoding is wrong, and ignore it if so.

-- Ken

On Oct 29, 2009, at 4:05pm, Fadzi Ushewokunze wrote:

hi there,

i am having issues with the HTMLParser failing to detect the char
encoding. so lots of non alpha-numeric chars end up as "?" ;

i dont have any special requirement for any special characters, i am
happy with usual utf-8

any suggestion on the best way to configure this correctly; everything
seems quite ok looking at the code not sure whats missing.

thanks.




--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378

Reply via email to