[jira] [Commented] (LUCENE-4286) Add flag to CJKBigramFilter to allow indexing unigrams as well as bigrams

Shawn Heisey (JIRA) Mon, 03 Dec 2012 16:28:00 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509373#comment-13509373
 ]


Shawn Heisey commented on LUCENE-4286:
--------------------------------------

I have just tried the indexUnigrams="true" on branch_4x checked out 2012/11/28 
and it doesn't seem to be working.  The analysis page (indexing) shows the 
bigrams, but no unigrams.  Am I doing something wrong?

my fieldType:

{code}
    <fieldType name="genText" class="solr.TextField" sortMissingLast="true" 
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.ICUTokenizerFactory"/>
        <filter class="solr.PatternReplaceFilterFactory"
          pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
          replacement="$2"
          allowempty="false"
        />
        <filter class="solr.WordDelimiterFilterFactory"
          splitOnCaseChange="1"
          splitOnNumerics="1"
          stemEnglishPossessive="1"
          generateWordParts="1"
          generateNumberParts="1"
          catenateWords="1"
          catenateNumbers="1"
          catenateAll="0"
          preserveOriginal="1"
        />
        <filter class="solr.ICUFoldingFilterFactory"/>
        <filter class="solr.CJKBigramFilterFactory" indexUnigrams="true"/>
        <filter class="solr.LengthFilterFactory" min="1" max="512"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.ICUTokenizerFactory"/>
        <filter class="solr.PatternReplaceFilterFactory"
          pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
          replacement="$2"
          allowempty="false"
        />
        <filter class="solr.WordDelimiterFilterFactory"
          splitOnCaseChange="1"
          splitOnNumerics="1"
          stemEnglishPossessive="1"
          generateWordParts="1"
          generateNumberParts="1"
          catenateWords="0"
          catenateNumbers="0"
          catenateAll="0"
          preserveOriginal="1"
        />
        <filter class="solr.ICUFoldingFilterFactory"/>
        <filter class="solr.CJKBigramFilterFactory" indexUnigrams="false"/>
        <filter class="solr.LengthFilterFactory" min="1" max="512"/>
      </analyzer>
    </fieldType>
{code}

                
> Add flag to CJKBigramFilter to allow indexing unigrams as well as bigrams
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-4286
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4286
>             Project: Lucene - Core
>          Issue Type: Improvement
>    Affects Versions: 4.0-ALPHA, 3.6.1
>            Reporter: Tom Burton-West
>            Priority: Minor
>             Fix For: 4.0-BETA, 5.0
>
>         Attachments: LUCENE-4286.patch, LUCENE-4286.patch, 
> LUCENE-4286.patch_3.x
>
>
> Add an optional  flag to the CJKBigramFilter to tell it to also output 
> unigrams.   This would allow indexing of both bigrams and unigrams and at 
> query time the analyzer could analyze queries as bigrams unless the query 
> contained a single Han unigram.
> As an example here is a configuration a Solr fieldType with the analyzer for 
> indexing with the "indexUnigrams" flag set and the analyzer for querying 
> without the flag. 
> <fieldType name="CJK" autoGeneratePhraseQueries="false">
> −
> <analyzer type="index">
>    <tokenizer class="solr.ICUTokenizerFactory"/>
>    <filter class="solr.CJKBigramFilterFactory" indexUnigrams="true" 
> han="true"/>
> </analyzer>
> <analyzer type="query">
>    <tokenizer class="solr.ICUTokenizerFactory"/>
>    <filter class="solr.CJKBigramFilterFactory" han="true"/>
> </analyzer>
> </fieldType>
> Use case: About 10% of our queries that contain Han characters are single 
> character queries.   The CJKBigram filter only outputs single characters when 
> there are no adjacent bigrammable characters in the input.  This means we 
> have to create a separate field to index Han unigrams in order to address 
> single character queries and then write application code to search that 
> separate field if we detect a single character Han query.  This is rather 
> kludgey.  With the optional flag, we could configure Solr as above  
> This is somewhat analogous to the flags in LUCENE-1370 for the ShingleFilter 
> used to allow single word queries (although that uses word n-grams rather 
> than character n-grams.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4286) Add flag to CJKBigramFilter to allow indexing unigrams as well as bigrams

Reply via email to