Re: Multi-language Tokenizers / Filters recommended?

2007-06-24 Thread Otis Gospodnetic
://lucene-consulting.com/ - Original Message From: Xuesong Luo [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Saturday, June 23, 2007 11:48:55 PM Subject: RE: Multi-language Tokenizers / Filters recommended? For chinese search, you may also consider

RE: Multi-language Tokenizers / Filters recommended?

2007-06-23 Thread Xuesong Luo
2:25 PM To: solr-user@lucene.apache.org Subject: RE: Multi-language Tokenizers / Filters recommended? Hi Daniel, As you know, Chinese and Japanese does not use space or any other delimiters to break words. To overcome this problem, CJKTokenizer uses a method called bi-gram where the run

RE: Multi-language Tokenizers / Filters recommended?

2007-06-22 Thread Teruhiko Kurosaka
Hi Daniel, As you know, Chinese and Japanese does not use space or any other delimiters to break words. To overcome this problem, CJKTokenizer uses a method called bi-gram where the run of ideographic (=Chinese) characters are made into tokens of two neighboring characters. So a run of five