Jun,

Welcome!

First, there are probably bugs in Nutch's handling of character sets, both in the HTML parser, and in the search page output. I would start by looking at this.

Also, currently we attempt to tokenize CJK ideogram into separate words. That could be buggy too! And it would be better to index these as bigrams. Please look at the classes in net.nutch.analysis. This is the tokenizer that's used for documents and queries. I don't think it would be too hard to get this to generate bigrams for adjacent CJK ideogram pairs. Does that make sense?

Longer term we should consider automatic language detection and use of a dictionary-based segmenter for Chinese. Does that sound like a reasonable plan?

Cheers,

Doug

Jun Cai wrote:
Hello nutch-developers,

I would like to participate the Nutch development. My
contribution could be Chinese language
Internationalization. My future interest would be
focus crawling by recursively training the crawled
document and user query refinement.

Breif introduction of myself:

Just finish my master study in University of Saarland,
Germany supervised by Prof. Gerhard Weikum. Thesis is
about bootstrapping ontology from Web documents.


Best regards, Jun Cai
[EMAIL PROTECTED]
2004-06-03







__________________________________
Do you Yahoo!?
Friends. Fun. Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/



------------------------------------------------------- This SF.Net email is sponsored by the new InstallShield X. From Windows to Linux, servers to mobile, InstallShield X is the one installation-authoring solution that does it all. Learn more and evaluate today! http://www.installshield.com/Dev2Dev/0504 _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers


-------------------------------------------------------
This SF.Net email is sponsored by the new InstallShield X.
From Windows to Linux, servers to mobile, InstallShield X is the one
installation-authoring solution that does it all. Learn more and
evaluate today! http://www.installshield.com/Dev2Dev/0504
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to