Support multiple language tokens in same field

Nitesh Kumar Fri, 03 Aug 2018 04:19:37 -0700

Hi,

We work in the proxy business, for our customer, where to achieve certain
business need,  we change some field information and send it to the Solr.
Below are the configurations of Solr schema.xml for a particular field at
Solr side.


    <fieldType name="text_general_no_norm" class="solr.TextField"
positionIncrementGap="100" omitNorms="true">
      <analyzer type="index">
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>


   <field name="field1" type="*text_general_no_norm*" indexed="true"
stored="true"/>
   <field name=" field2" type="*text_general_no_norm*" indexed="true"
stored="true"/>

As I discussed above,  in some special case, we have a situation where
these fields ( field1, field2  etc..) value can be in *CJK *pattern. That
means  field1, field2 store plain *English *text or *CJK *text. Hence, in
case of choosing *StandardTokenizer, *while indexing/query it works fine
when it has to deal with plain *English text*, whereas in the case of *CJK
text *it doesn't work appropriately.

When we index a CJK text with the current configuration, it breaks each
character (Sometimes it pairs multiple characters also) and index it. As an
example
if we index a text

field1: "*맯뭕禪玸킆諘叜葸*", according to StandardTokenizer logic it breaks into
  맯          뭕     禪玸     킆諘     叜    葸 and index in it.

Later, when we search on the same field with the similar text -
q : field1: "*맯뭕禪玸킆諘叜葸*", it gives the result along with it more irrelevant
result also. Our assumption is, when *Lucene *breaks and build multiple
tokens for querying, it performs *OR *operation with all tokens. Hence, Any
single token  from (맯     OR     뭕  OR    禪玸  OR    킆諘  OR   叜  OR   葸)
will present any record, it will return that record as a result.

We also tried *LetterTokenizer*, but it doesn't behave same as like
*StandardTokenizer
*in many cases. We also tried *Copy field* options, but it is also not
feasible as the application layer is not so flexible to determine CJK token
and change the query at runtime.

Please suggest any approach for indexing or querying, so that we could
filter irrelevant results.

Thanks in advance.
Nitesh

Support multiple language tokens in same field

Reply via email to