Hi, We work in the proxy business, for our customer, where to achieve certain business need, we change some field information and send it to the Solr. Below are the configurations of Solr schema.xml for a particular field at Solr side.
<fieldType name="text_general_no_norm" class="solr.TextField" positionIncrementGap="100" omitNorms="true"> <analyzer type="index"> <filter class="solr.ASCIIFoldingFilterFactory"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <filter class="solr.ASCIIFoldingFilterFactory"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> <field name="field1" type="*text_general_no_norm*" indexed="true" stored="true"/> <field name=" field2" type="*text_general_no_norm*" indexed="true" stored="true"/> As I discussed above, in some special case, we have a situation where these fields ( field1, field2 etc..) value can be in *CJK *pattern. That means field1, field2 store plain *English *text or *CJK *text. Hence, in case of choosing *StandardTokenizer, *while indexing/query it works fine when it has to deal with plain *English text*, whereas in the case of *CJK text *it doesn't work appropriately. When we index a CJK text with the current configuration, it breaks each character (Sometimes it pairs multiple characters also) and index it. As an example if we index a text field1: "*맯뭕禪玸킆諘叜葸*", according to StandardTokenizer logic it breaks into 맯 뭕 禪玸 킆諘 叜 葸 and index in it. Later, when we search on the same field with the similar text - q : field1: "*맯뭕禪玸킆諘叜葸*", it gives the result along with it more irrelevant result also. Our assumption is, when *Lucene *breaks and build multiple tokens for querying, it performs *OR *operation with all tokens. Hence, Any single token from (맯 OR 뭕 OR 禪玸 OR 킆諘 OR 叜 OR 葸) will present any record, it will return that record as a result. We also tried *LetterTokenizer*, but it doesn't behave same as like *StandardTokenizer *in many cases. We also tried *Copy field* options, but it is also not feasible as the application layer is not so flexible to determine CJK token and change the query at runtime. Please suggest any approach for indexing or querying, so that we could filter irrelevant results. Thanks in advance. Nitesh