You are warmly welcome to discuss these problems with me .  I am heavily
involved in Chinese text parsing and segementing.

David Cai, cto of
v-search.51vip.biz
qq: 81296154
msn: [EMAIL PROTECTED]
email: [EMAIL PROTECTED]


2008/10/16 [EMAIL PROTECTED] <[EMAIL PROTECTED]>

> Let me ask the question in a different way, hopefully you guys can shed
> some
> lights.
>
> The two ways I used "readseg -get" and "readseg -dump" gave three different
> texts:
> 1) The Chinese text in "parsetext" section is all correct (via -get)
> 2) The Chinese text in html is all messed up (via -get)
> 3) The Chinese text in html is largely correct, but messed up especially
> near Roman punctuations and braces.
>
> Guess I need to know if this is a fetch problem or readseg problem before
> plunging in the source (as a greenhand).
>
> Thanks in adv!
>
> ---------- Forwarded message ----------
> From: [EMAIL PROTECTED] <[EMAIL PROTECTED]>
> Date: 2008/10/14
> Subject: Fwd: Fetch/Dump problem: Some Chinese characters incorrect.
> To: [email protected]
>
>
> And it's becoming weirder when I used "readseg -get".
>
> The Chinese text in "parsetext" section is all correct, while the main html
> page is totally messed up, both different from what I got with "readseg
> -dump".
>
> Anybody has a clue? Seems to be a SegmentReader problem, which for some
> reason used shaky encoding/conversion pulling text from segments?
>
> By the way, all the Chinese characters are in three-byte UTF-8.
>
>
> ---------- Forwarded message ----------
> From: [EMAIL PROTECTED] <[EMAIL PROTECTED]>
> Date: 2008/10/13
> Subject: Fetch/Dump problem: Some Chinese characters incorrect.
> To: [email protected]
>
>
> I obtained some Chinese language webpages via "nutch fetch". But some
> Chinese characters do not come out right after I dumped the segment back to
> html pages. For instance:
> http://www.dianping.com/shop/501079/
> has title portion:
> <head><title>
> 韶山冲(徐汇店)(图)_上海_大众点评网
> </title>
>
> However, I got this after dumping:
> <head><title>
> 韶山��1¤7(徐汇庄1¤7)(��1¤7)_上海_大众点评罄1¤7
> </title>
>
>
> The charset specified in the page is "UTF-8". As I includeded the following
> in "nutch-site.xml"
> <name>parser.character.encoding.default</name>
>  <value>UTF-8</value>
>
> It makes no difference.
>
> What could be the problem?
>
>
> [image: 回复时引用此帖] <http://newreply.php?do=newreply&p=5869>
>

Reply via email to