Let me ask the question in a different way, hopefully you guys can shed some
lights.

The two ways I used "readseg -get" and "readseg -dump" gave three different
texts:
1) The Chinese text in "parsetext" section is all correct (via -get)
2) The Chinese text in html is all messed up (via -get)
3) The Chinese text in html is largely correct, but messed up especially
near Roman punctuations and braces.

Guess I need to know if this is a fetch problem or readseg problem before
plunging in the source (as a greenhand).

Thanks in adv!

---------- Forwarded message ----------
From: [EMAIL PROTECTED] <[EMAIL PROTECTED]>
Date: 2008/10/14
Subject: Fwd: Fetch/Dump problem: Some Chinese characters incorrect.
To: [email protected]


And it's becoming weirder when I used "readseg -get".

The Chinese text in "parsetext" section is all correct, while the main html
page is totally messed up, both different from what I got with "readseg
-dump".

Anybody has a clue? Seems to be a SegmentReader problem, which for some
reason used shaky encoding/conversion pulling text from segments?

By the way, all the Chinese characters are in three-byte UTF-8.


---------- Forwarded message ----------
From: [EMAIL PROTECTED] <[EMAIL PROTECTED]>
Date: 2008/10/13
Subject: Fetch/Dump problem: Some Chinese characters incorrect.
To: [email protected]


I obtained some Chinese language webpages via "nutch fetch". But some
Chinese characters do not come out right after I dumped the segment back to
html pages. For instance:
http://www.dianping.com/shop/501079/
has title portion:
<head><title>
韶山冲(徐汇店)(图)_上海_大众点评网
</title>

However, I got this after dumping:
<head><title>
韶山��1¤7(徐汇庄1¤7)(��1¤7)_上海_大众点评罄1¤7
</title>


The charset specified in the page is "UTF-8". As I includeded the following
in "nutch-site.xml"
<name>parser.character.encoding.default</name>
  <value>UTF-8</value>

It makes no difference.

What could be the problem?


[image: 回复时引用此帖] <http://newreply.php?do=newreply&p=5869>

Reply via email to