I have the same problem with caching while crawling the pages in Vietnamese 
using utf-8 charset. I have digged into nutch configurations but have no idea 
how to solve.

By the way, anyone know how to force the crawler not to cache (not put the 
cache data to DB)

Here is my search 
(http://203.162.71.66:8080/search.jsp?query=%22qu%E1%BA%A3n+l%C3%BD%22&hitsPerPage=10&lang=en)
 and its cache (http://203.162.71.66:8080/cached.jsp?idx=0&id=37)

How should I do :(

Best reguards

-----Original Message-----
From: xu xiong [mailto:[EMAIL PROTECTED] 
Sent: 07 tháng sáu 2007 9:22 Sáng
To: [EMAIL PROTECTED]
Subject: ParseData encoding problem

Hi,

I use nutch 0.9 to crawl some Chinese web site, and search using nutch
web portal but found that cached html copy display incorrectly.
Then I use "bin/nutch readseg -dump" to dump segments :
ParseText(UTF-8) display correctly, but the Chinse character in
Content display incorrectly as '?'.--the original html uses gd2312
charset.

What's the possible cause? And how to fix?

Thanks in advance,
Xiong


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to