[
https://issues.apache.org/jira/browse/LUCENE-4286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509373#comment-13509373
]
Shawn Heisey commented on LUCENE-4286:
--------------------------------------
I have just tried the indexUnigrams="true" on branch_4x checked out 2012/11/28
and it doesn't seem to be working. The analysis page (indexing) shows the
bigrams, but no unigrams. Am I doing something wrong?
my fieldType:
{code}
<fieldType name="genText" class="solr.TextField" sortMissingLast="true"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory"
pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
replacement="$2"
allowempty="false"
/>
<filter class="solr.WordDelimiterFilterFactory"
splitOnCaseChange="1"
splitOnNumerics="1"
stemEnglishPossessive="1"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="0"
preserveOriginal="1"
/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory" indexUnigrams="true"/>
<filter class="solr.LengthFilterFactory" min="1" max="512"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory"
pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
replacement="$2"
allowempty="false"
/>
<filter class="solr.WordDelimiterFilterFactory"
splitOnCaseChange="1"
splitOnNumerics="1"
stemEnglishPossessive="1"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
preserveOriginal="1"
/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory" indexUnigrams="false"/>
<filter class="solr.LengthFilterFactory" min="1" max="512"/>
</analyzer>
</fieldType>
{code}
> Add flag to CJKBigramFilter to allow indexing unigrams as well as bigrams
> -------------------------------------------------------------------------
>
> Key: LUCENE-4286
> URL: https://issues.apache.org/jira/browse/LUCENE-4286
> Project: Lucene - Core
> Issue Type: Improvement
> Affects Versions: 4.0-ALPHA, 3.6.1
> Reporter: Tom Burton-West
> Priority: Minor
> Fix For: 4.0-BETA, 5.0
>
> Attachments: LUCENE-4286.patch, LUCENE-4286.patch,
> LUCENE-4286.patch_3.x
>
>
> Add an optional flag to the CJKBigramFilter to tell it to also output
> unigrams. This would allow indexing of both bigrams and unigrams and at
> query time the analyzer could analyze queries as bigrams unless the query
> contained a single Han unigram.
> As an example here is a configuration a Solr fieldType with the analyzer for
> indexing with the "indexUnigrams" flag set and the analyzer for querying
> without the flag.
> <fieldType name="CJK" autoGeneratePhraseQueries="false">
> −
> <analyzer type="index">
> <tokenizer class="solr.ICUTokenizerFactory"/>
> <filter class="solr.CJKBigramFilterFactory" indexUnigrams="true"
> han="true"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.ICUTokenizerFactory"/>
> <filter class="solr.CJKBigramFilterFactory" han="true"/>
> </analyzer>
> </fieldType>
> Use case: About 10% of our queries that contain Han characters are single
> character queries. The CJKBigram filter only outputs single characters when
> there are no adjacent bigrammable characters in the input. This means we
> have to create a separate field to index Han unigrams in order to address
> single character queries and then write application code to search that
> separate field if we detect a single character Han query. This is rather
> kludgey. With the optional flag, we could configure Solr as above
> This is somewhat analogous to the flags in LUCENE-1370 for the ShingleFilter
> used to allow single word queries (although that uses word n-grams rather
> than character n-grams.)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]