[
https://issues.apache.org/jira/browse/LUCENE-4286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tom Burton-West updated LUCENE-4286:
------------------------------------
Attachment: LUCENE-4286.patch_3.x
We are still using Solr 3.6 in production so I backported the patch to
Lucene/Solr 3.6. Attached as LUCENE-4286.patch_3.x
> Add flag to CJKBigramFilter to allow indexing unigrams as well as bigrams
> -------------------------------------------------------------------------
>
> Key: LUCENE-4286
> URL: https://issues.apache.org/jira/browse/LUCENE-4286
> Project: Lucene - Core
> Issue Type: Improvement
> Affects Versions: 4.0-ALPHA, 3.6.1
> Reporter: Tom Burton-West
> Priority: Minor
> Fix For: 4.0-BETA, 5.0
>
> Attachments: LUCENE-4286.patch, LUCENE-4286.patch,
> LUCENE-4286.patch_3.x
>
>
> Add an optional flag to the CJKBigramFilter to tell it to also output
> unigrams. This would allow indexing of both bigrams and unigrams and at
> query time the analyzer could analyze queries as bigrams unless the query
> contained a single Han unigram.
> As an example here is a configuration a Solr fieldType with the analyzer for
> indexing with the "indexUnigrams" flag set and the analyzer for querying
> without the flag.
> <fieldType name="CJK" autoGeneratePhraseQueries="false">
> −
> <analyzer type="index">
> <tokenizer class="solr.ICUTokenizerFactory"/>
> <filter class="solr.CJKBigramFilterFactory" indexUnigrams="true"
> han="true"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.ICUTokenizerFactory"/>
> <filter class="solr.CJKBigramFilterFactory" han="true"/>
> </analyzer>
> </fieldType>
> Use case: About 10% of our queries that contain Han characters are single
> character queries. The CJKBigram filter only outputs single characters when
> there are no adjacent bigrammable characters in the input. This means we
> have to create a separate field to index Han unigrams in order to address
> single character queries and then write application code to search that
> separate field if we detect a single character Han query. This is rather
> kludgey. With the optional flag, we could configure Solr as above
> This is somewhat analogous to the flags in LUCENE-1370 for the ShingleFilter
> used to allow single word queries (although that uses word n-grams rather
> than character n-grams.)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]