Re: EdgeNGramFilterFactory for Chinese characters

2015-10-26 Thread Zheng Lin Edwin Yeo
Hi Tomoko, Thank you for your advice. Will look into the java source code of the Token Filters. Regards, Edwin On 26 October 2015 at 13:16, Tomoko Uchida wrote: > > Will try to see if there is anyway to managed it by only a single field? > > Of course you can try to create custom Tokenizer or

Re: EdgeNGramFilterFactory for Chinese characters

2015-10-25 Thread Tomoko Uchida
> Will try to see if there is anyway to managed it by only a single field? Of course you can try to create custom Tokenizer or TokenFilter that perfectly meets your needs. I would copy the source codes of EdgeNGramTokenFilter and modify incrementToken() method. It seems reasonable way for me. incr

Re: EdgeNGramFilterFactory for Chinese characters

2015-10-25 Thread Zheng Lin Edwin Yeo
Hi Tomoko, Thank you for your recommendation. I wasn't in favour of using copyField at first to have 2 separate fields for English and Chinese tokens, as it not only increase the index size, but also slow down the performance for both indexing and querying. Will try to see if there is anyway to

Re: EdgeNGramFilterFactory for Chinese characters

2015-10-25 Thread Tomoko Uchida
Hi, Edwin, > This means it is better to have 2 separate fields for English and Chinese words? Yes. I mean, 1. Define FIELD_1 that use HMMChineseTokenizerFactory to extract English and Chinese tokens. 2. Define FIELD_2 that use PatternTokenizerFactory to extract English tokens and EdgeNGramFilter

Re: EdgeNGramFilterFactory for Chinese characters

2015-10-25 Thread Zheng Lin Edwin Yeo
Hi Tomoko, Thank you for your reply. > If you need to perform partial (prefix) match for **only English words**, > you can create a separate field that keeps only English words (I've never > tried that, but might be possible by PatternTokenizerFactory or other > tokenizer/filter chains...,) and a

Re: EdgeNGramFilterFactory for Chinese characters

2015-10-24 Thread Tomoko Uchida
> I have rich-text documents that are in both English and Chinese, and > currently I have EdgeNGramFilterFactory enabled during indexing, as I need > it for partial matching for English words. But this means it will also > break up each of the Chinese characters into different tokens. EdgeNGramFil

EdgeNGramFilterFactory for Chinese characters

2015-10-22 Thread Zheng Lin Edwin Yeo
Hi, Would like to check, is it good to use EdgeNGramFilterFactory for indexes that contains Chinese characters? Will it affect the accuracy of the search for Chinese words? I have rich-text documents that are in both English and Chinese, and currently I have EdgeNGramFilterFactory enabled during