Tom Burton-West created LUCENE-4286:
---------------------------------------

             Summary: Add flag to CJKBigramFilter to allow indexing unigrams as 
well is bigrams
                 Key: LUCENE-4286
                 URL: https://issues.apache.org/jira/browse/LUCENE-4286
             Project: Lucene - Core
          Issue Type: Improvement
    Affects Versions: 3.6.1, 4.0-ALPHA
            Reporter: Tom Burton-West
            Priority: Minor


Add an optional  flag to the CJKBigramFilter to tell it to also output 
unigrams.   This would allow indexing of both bigrams and unigrams and at query 
time the analyzer could analyze queries as bigrams unless the query contained a 
single Han unigram.

As an example here is a configuration a Solr fieldType with the analyzer for 
indexing with the "indexUnigrams" flag set and the analyzer for querying 
without the flag. 

<fieldType name="CJK" autoGeneratePhraseQueries="false">
−
<analyzer type="index">
   <tokenizer class="solr.ICUTokenizerFactory"/>
   <filter class="solr.CJKBigramFilterFactory" indexUnigrams="true" han="true"/>
</analyzer>

<analyzer type="query">
   <tokenizer class="solr.ICUTokenizerFactory"/>
   <filter class="solr.CJKBigramFilterFactory" han="true"/>
</analyzer>
</fieldType>

Use case: About 10% of our queries that contain Han characters are single 
character queries.   The CJKBigram filter only outputs single characters when 
there are no adjacent bigrammable characters in the input.  This means we have 
to create a separate field to index Han unigrams in order to address single 
character queries and then write application code to search that separate field 
if we detect a single character Han query.  This is rather kludgey.  With the 
optional flag, we could configure Solr as above  

This is somewhat analogous to the flags in LUCENE-1370 for the ShingleFilter 
used to allow single word queries (although that uses word n-grams rather than 
character n-grams.)


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to