Let me ask the question in a different way, hopefully you guys can shed some lights.
The two ways I used "readseg -get" and "readseg -dump" gave three different texts: 1) The Chinese text in "parsetext" section is all correct (via -get) 2) The Chinese text in html is all messed up (via -get) 3) The Chinese text in html is largely correct, but messed up especially near Roman punctuations and braces. Guess I need to know if this is a fetch problem or readseg problem before plunging in the source (as a greenhand). Thanks in adv! ---------- Forwarded message ---------- From: [EMAIL PROTECTED] <[EMAIL PROTECTED]> Date: 2008/10/14 Subject: Fwd: Fetch/Dump problem: Some Chinese characters incorrect. To: [email protected] And it's becoming weirder when I used "readseg -get". The Chinese text in "parsetext" section is all correct, while the main html page is totally messed up, both different from what I got with "readseg -dump". Anybody has a clue? Seems to be a SegmentReader problem, which for some reason used shaky encoding/conversion pulling text from segments? By the way, all the Chinese characters are in three-byte UTF-8. ---------- Forwarded message ---------- From: [EMAIL PROTECTED] <[EMAIL PROTECTED]> Date: 2008/10/13 Subject: Fetch/Dump problem: Some Chinese characters incorrect. To: [email protected] I obtained some Chinese language webpages via "nutch fetch". But some Chinese characters do not come out right after I dumped the segment back to html pages. For instance: http://www.dianping.com/shop/501079/ has title portion: <head><title> 韶山冲(徐汇店)(图)_上海_大众点评网 </title> However, I got this after dumping: <head><title> 韶山��1¤7(徐汇庄1¤7)(��1¤7)_上海_大众点评罄1¤7 </title> The charset specified in the page is "UTF-8". As I includeded the following in "nutch-site.xml" <name>parser.character.encoding.default</name> <value>UTF-8</value> It makes no difference. What could be the problem? [image: 回复时引用此帖] <http://newreply.php?do=newreply&p=5869>
