Hi, all

I'm trying to make Nutch support Chinese and got a funny issue:  the crawler 
printed out 
following log infomation:

Indexing [http://sc.yfsz.com/cat.asp?catid=120] with analyzer [EMAIL PROTECTED] 
(it)
Indexing [http://sc.yfsz.com/cat.asp?catid=126] with analyzer 
[EMAIL PROTECTED] (nl)
Indexing [http://sc.yfsz.com/cat.asp?catid=130] with analyzer [EMAIL PROTECTED] 
(fi)
Indexing [http://sc.yfsz.com/cat.asp?catid=128] with analyzer [EMAIL PROTECTED] 
(en)
Indexing [http://sc.yfsz.com/help.asp?action=fukuan] with analyzer [EMAIL 
PROTECTED] (zh)
...

Actually, all these weg page are in Chinese, so doc.get("lang") should always 
return
"zh" in Indexer.java, line 91. But you can see it somtimes returns "it", "fi", 
"en" and so on,
then the build-in analyzer is called. 

Is it a bug of language-identifier or NekoHtml?




songjue
2007-04-16

Reply via email to