[ https://issues.apache.org/jira/browse/LUCENE-4286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tom Burton-West updated LUCENE-4286: ------------------------------------ Attachment: LUCENE-4286.patch_3.x We are still using Solr 3.6 in production so I backported the patch to Lucene/Solr 3.6. Attached as LUCENE-4286.patch_3.x > Add flag to CJKBigramFilter to allow indexing unigrams as well as bigrams > ------------------------------------------------------------------------- > > Key: LUCENE-4286 > URL: https://issues.apache.org/jira/browse/LUCENE-4286 > Project: Lucene - Core > Issue Type: Improvement > Affects Versions: 4.0-ALPHA, 3.6.1 > Reporter: Tom Burton-West > Priority: Minor > Fix For: 4.0-BETA, 5.0 > > Attachments: LUCENE-4286.patch, LUCENE-4286.patch, > LUCENE-4286.patch_3.x > > > Add an optional flag to the CJKBigramFilter to tell it to also output > unigrams. This would allow indexing of both bigrams and unigrams and at > query time the analyzer could analyze queries as bigrams unless the query > contained a single Han unigram. > As an example here is a configuration a Solr fieldType with the analyzer for > indexing with the "indexUnigrams" flag set and the analyzer for querying > without the flag. > <fieldType name="CJK" autoGeneratePhraseQueries="false"> > − > <analyzer type="index"> > <tokenizer class="solr.ICUTokenizerFactory"/> > <filter class="solr.CJKBigramFilterFactory" indexUnigrams="true" > han="true"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.ICUTokenizerFactory"/> > <filter class="solr.CJKBigramFilterFactory" han="true"/> > </analyzer> > </fieldType> > Use case: About 10% of our queries that contain Han characters are single > character queries. The CJKBigram filter only outputs single characters when > there are no adjacent bigrammable characters in the input. This means we > have to create a separate field to index Han unigrams in order to address > single character queries and then write application code to search that > separate field if we detect a single character Han query. This is rather > kludgey. With the optional flag, we could configure Solr as above > This is somewhat analogous to the flags in LUCENE-1370 for the ShingleFilter > used to allow single word queries (although that uses word n-grams rather > than character n-grams.) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org