Spellchecking in the Chinese Lanugage
Hi, I have been trying to get spellcheck to work in the Chinese language. So far I have not had any luck. Can someone shed some light here as a general guide line in terms of what need to happen? I am using the CJKAnalyzer in the text field type and searching works fine, but spelling does not work. Here are the things I have tried: 1. Put CJKAnalyzer in the textSpell field type. 2. Set the characterEncoding param to utf-8 in the spellcheck search component. 3. Using Luke, I can see the Chinese characters in the spell field in the main index. 4. After building the spelling index, I don't see Chinese characters in the spellchecker index, only terms in English. 5. Tried adding the NGramFilterFactory to the CJKAnalyzer with no luck either. Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Spellchecking-in-the-Chinese-Lanugage-tp2812726p2812726.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Spellchecking in the Chinese Lanugage
Hi, Does spellchecking in Chinese actually make sense? I once asked a native Chinese speaker about that and the person told me it didn't really make sense. Anyhow, with n-grams, I don't think this could technically work even if it made sense for Chinese, could it? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: alexw aw...@crossview.com To: solr-user@lucene.apache.org Sent: Tue, April 12, 2011 3:07:48 PM Subject: Spellchecking in the Chinese Lanugage Hi, I have been trying to get spellcheck to work in the Chinese language. So far I have not had any luck. Can someone shed some light here as a general guide line in terms of what need to happen? I am using the CJKAnalyzer in the text field type and searching works fine, but spelling does not work. Here are the things I have tried: 1. Put CJKAnalyzer in the textSpell field type. 2. Set the characterEncoding param to utf-8 in the spellcheck search component. 3. Using Luke, I can see the Chinese characters in the spell field in the main index. 4. After building the spelling index, I don't see Chinese characters in the spellchecker index, only terms in English. 5. Tried adding the NGramFilterFactory to the CJKAnalyzer with no luck either. Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Spellchecking-in-the-Chinese-Lanugage-tp2812726p2812726.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Spellchecking in the Chinese Lanugage
It doesn't make sense to spell check individual character sized words, but makes a lot of sense for phrases. Due to pervasive use of pinyin IM, it's very easy to write phrases that are totally wrong in semantics and but sounds correct. n-gram should work if it doesn't mangle the characters. On Tue, Apr 12, 2011 at 12:47 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, Does spellchecking in Chinese actually make sense? I once asked a native Chinese speaker about that and the person told me it didn't really make sense. Anyhow, with n-grams, I don't think this could technically work even if it made sense for Chinese, could it? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: alexw aw...@crossview.com To: solr-user@lucene.apache.org Sent: Tue, April 12, 2011 3:07:48 PM Subject: Spellchecking in the Chinese Lanugage Hi, I have been trying to get spellcheck to work in the Chinese language. So far I have not had any luck. Can someone shed some light here as a general guide line in terms of what need to happen? I am using the CJKAnalyzer in the text field type and searching works fine, but spelling does not work. Here are the things I have tried: 1. Put CJKAnalyzer in the textSpell field type. 2. Set the characterEncoding param to utf-8 in the spellcheck search component. 3. Using Luke, I can see the Chinese characters in the spell field in the main index. 4. After building the spelling index, I don't see Chinese characters in the spellchecker index, only terms in English. 5. Tried adding the NGramFilterFactory to the CJKAnalyzer with no luck either. Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Spellchecking-in-the-Chinese-Lanugage-tp2812726p2812726.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Spellchecking in the Chinese Lanugage
Thanks Otis and Luke. Yes it does make sense to spellcheck phrases in Chinese. Looks like the default Solr spellCheck component is already doing some kind of NGram-ing. When examining the spellCheck index, I did see gram1, gram2, gram3, gram4... The problem is no Chinese terms were indexed into the spellChecker index, only English terms. Regards, Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Spellchecking-in-the-Chinese-Lanugage-tp2812726p2813149.html Sent from the Solr - User mailing list archive at Nabble.com.