[
https://issues.apache.org/jira/browse/LUCENE-4286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13431297#comment-13431297
]
Tom Burton-West commented on LUCENE-4286:
-----------------------------------------
Thanks Robert for all your work on non-English searching and for your quick
response on this issue.
>>If you do unigrams and bigrams in separate fields, you can bias bigrams over
>>unigrams.
That was our original intention.
>>The combined unigram+bigram technique is a general technique, which I think
>>is useful to support. ...Tom would have to do tests for his "index-time-only"
>>approach: I can't speak for that.
Originally I was going to use the combined unigram+bigram technique (with a
boost for the bigram fields) and wrote some custom code to implement it.
However, I started thinking about the size of our documents. With one
exception, all the literature I found that got better results with a
combination of bigrams and unigrams used newswire size documents (somewhere in
the range of a few hundred words). Our documents are several orders of
magnitude larger (around 100,000 words).
My understanding is that the main reason adding unigrams to bigrams increases
relevance is that often the unigram will have a related meaning to the larger
word. So using unigrams is somewhat analogous to decompounding or stemming. I
haven't done any tests, but my guess is that with our very large documents the
additional recall added by unigrams will be offset by a decrease in precision.
After I get a test suite set up for relevance ranking in English, I'll take a
look at testing CJK :)
Tom
> Add flag to CJKBigramFilter to allow indexing unigrams as well as bigrams
> -------------------------------------------------------------------------
>
> Key: LUCENE-4286
> URL: https://issues.apache.org/jira/browse/LUCENE-4286
> Project: Lucene - Core
> Issue Type: Improvement
> Affects Versions: 4.0-ALPHA, 3.6.1
> Reporter: Tom Burton-West
> Priority: Minor
> Fix For: 4.0-BETA, 5.0
>
> Attachments: LUCENE-4286.patch, LUCENE-4286.patch
>
>
> Add an optional flag to the CJKBigramFilter to tell it to also output
> unigrams. This would allow indexing of both bigrams and unigrams and at
> query time the analyzer could analyze queries as bigrams unless the query
> contained a single Han unigram.
> As an example here is a configuration a Solr fieldType with the analyzer for
> indexing with the "indexUnigrams" flag set and the analyzer for querying
> without the flag.
> <fieldType name="CJK" autoGeneratePhraseQueries="false">
> −
> <analyzer type="index">
> <tokenizer class="solr.ICUTokenizerFactory"/>
> <filter class="solr.CJKBigramFilterFactory" indexUnigrams="true"
> han="true"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.ICUTokenizerFactory"/>
> <filter class="solr.CJKBigramFilterFactory" han="true"/>
> </analyzer>
> </fieldType>
> Use case: About 10% of our queries that contain Han characters are single
> character queries. The CJKBigram filter only outputs single characters when
> there are no adjacent bigrammable characters in the input. This means we
> have to create a separate field to index Han unigrams in order to address
> single character queries and then write application code to search that
> separate field if we detect a single character Han query. This is rather
> kludgey. With the optional flag, we could configure Solr as above
> This is somewhat analogous to the flags in LUCENE-1370 for the ShingleFilter
> used to allow single word queries (although that uses word n-grams rather
> than character n-grams.)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]