Hi,

I use nutch 0.9 to crawl some Chinese web site, and search using nutch
web portal but found that cached html copy display incorrectly.
Then I use "bin/nutch readseg -dump" to dump segments :
ParseText(UTF-8) display correctly, but the Chinse character in
Content display incorrectly as '?'.--the original html uses gd2312
charset.

What's the possible cause? And how to fix?

Thanks in advance,
Xiong

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to