[jira] [Updated] (LUCENE-4286) Add flag to CJKBigramFilter to allow indexing unigrams as well as bigrams

Tom Burton-West (JIRA) Thu, 29 Nov 2012 10:04:59 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-4286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tom Burton-West updated LUCENE-4286:
------------------------------------

    Attachment: LUCENE-4286.patch_3.x

We are still using Solr 3.6 in production so I backported the patch to 
Lucene/Solr 3.6.  Attached as LUCENE-4286.patch_3.x
                
> Add flag to CJKBigramFilter to allow indexing unigrams as well as bigrams
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-4286
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4286
>             Project: Lucene - Core
>          Issue Type: Improvement
>    Affects Versions: 4.0-ALPHA, 3.6.1
>            Reporter: Tom Burton-West
>            Priority: Minor
>             Fix For: 4.0-BETA, 5.0
>
>         Attachments: LUCENE-4286.patch, LUCENE-4286.patch, 
> LUCENE-4286.patch_3.x
>
>
> Add an optional  flag to the CJKBigramFilter to tell it to also output 
> unigrams.   This would allow indexing of both bigrams and unigrams and at 
> query time the analyzer could analyze queries as bigrams unless the query 
> contained a single Han unigram.
> As an example here is a configuration a Solr fieldType with the analyzer for 
> indexing with the "indexUnigrams" flag set and the analyzer for querying 
> without the flag. 
> <fieldType name="CJK" autoGeneratePhraseQueries="false">
> −
> <analyzer type="index">
>    <tokenizer class="solr.ICUTokenizerFactory"/>
>    <filter class="solr.CJKBigramFilterFactory" indexUnigrams="true" 
> han="true"/>
> </analyzer>
> <analyzer type="query">
>    <tokenizer class="solr.ICUTokenizerFactory"/>
>    <filter class="solr.CJKBigramFilterFactory" han="true"/>
> </analyzer>
> </fieldType>
> Use case: About 10% of our queries that contain Han characters are single 
> character queries.   The CJKBigram filter only outputs single characters when 
> there are no adjacent bigrammable characters in the input.  This means we 
> have to create a separate field to index Han unigrams in order to address 
> single character queries and then write application code to search that 
> separate field if we detect a single character Han query.  This is rather 
> kludgey.  With the optional flag, we could configure Solr as above  
> This is somewhat analogous to the flags in LUCENE-1370 for the ShingleFilter 
> used to allow single word queries (although that uses word n-grams rather 
> than character n-grams.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4286) Add flag to CJKBigramFilter to allow indexing unigrams as well as bigrams

Reply via email to