I have the same problem with caching while crawling the pages in Vietnamese using utf-8 charset. I have digged into nutch configurations but have no idea how to solve.
By the way, anyone know how to force the crawler not to cache (not put the cache data to DB) Here is my search (http://203.162.71.66:8080/search.jsp?query=%22qu%E1%BA%A3n+l%C3%BD%22&hitsPerPage=10&lang=en) and its cache (http://203.162.71.66:8080/cached.jsp?idx=0&id=37) How should I do :( Best reguards -----Original Message----- From: xu xiong [mailto:[EMAIL PROTECTED] Sent: 07 tháng sáu 2007 9:22 Sáng To: [EMAIL PROTECTED] Subject: ParseData encoding problem Hi, I use nutch 0.9 to crawl some Chinese web site, and search using nutch web portal but found that cached html copy display incorrectly. Then I use "bin/nutch readseg -dump" to dump segments : ParseText(UTF-8) display correctly, but the Chinse character in Content display incorrectly as '?'.--the original html uses gd2312 charset. What's the possible cause? And how to fix? Thanks in advance, Xiong ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
