You are warmly welcome to discuss these problems with me . I am heavily involved in Chinese text parsing and segementing.
David Cai, cto of v-search.51vip.biz qq: 81296154 msn: [EMAIL PROTECTED] email: [EMAIL PROTECTED] 2008/10/16 [EMAIL PROTECTED] <[EMAIL PROTECTED]> > Let me ask the question in a different way, hopefully you guys can shed > some > lights. > > The two ways I used "readseg -get" and "readseg -dump" gave three different > texts: > 1) The Chinese text in "parsetext" section is all correct (via -get) > 2) The Chinese text in html is all messed up (via -get) > 3) The Chinese text in html is largely correct, but messed up especially > near Roman punctuations and braces. > > Guess I need to know if this is a fetch problem or readseg problem before > plunging in the source (as a greenhand). > > Thanks in adv! > > ---------- Forwarded message ---------- > From: [EMAIL PROTECTED] <[EMAIL PROTECTED]> > Date: 2008/10/14 > Subject: Fwd: Fetch/Dump problem: Some Chinese characters incorrect. > To: [email protected] > > > And it's becoming weirder when I used "readseg -get". > > The Chinese text in "parsetext" section is all correct, while the main html > page is totally messed up, both different from what I got with "readseg > -dump". > > Anybody has a clue? Seems to be a SegmentReader problem, which for some > reason used shaky encoding/conversion pulling text from segments? > > By the way, all the Chinese characters are in three-byte UTF-8. > > > ---------- Forwarded message ---------- > From: [EMAIL PROTECTED] <[EMAIL PROTECTED]> > Date: 2008/10/13 > Subject: Fetch/Dump problem: Some Chinese characters incorrect. > To: [email protected] > > > I obtained some Chinese language webpages via "nutch fetch". But some > Chinese characters do not come out right after I dumped the segment back to > html pages. For instance: > http://www.dianping.com/shop/501079/ > has title portion: > <head><title> > 韶山冲(徐汇店)(图)_上海_大众点评网 > </title> > > However, I got this after dumping: > <head><title> > 韶山��1¤7(徐汇庄1¤7)(��1¤7)_上海_大众点评罄1¤7 > </title> > > > The charset specified in the page is "UTF-8". As I includeded the following > in "nutch-site.xml" > <name>parser.character.encoding.default</name> > <value>UTF-8</value> > > It makes no difference. > > What could be the problem? > > > [image: 回复时引用此帖] <http://newreply.php?do=newreply&p=5869> >
