Re: EdgeNGram relevancy

Robert Gründler Thu, 11 Nov 2010 12:51:37 -0800

according to the fieldtype i posted previously, i think it's because of:

1. WhiteSpaceTokenizer splits the String "Clyde Phillips" into 2 tokens: 
"Clyde" and "Phillips"
2. EdgeNGramFilter gets the 2 tokens, and creates an EdgeNGram for each token: 
"C" "Cl" "Cly" ...   AND  "P" "Ph" "Phi" ...


The Query String "Bill Cl" gets split up in 2 Tokens "Bill" and "Cl" by the 
WhitespaceTokenizer.

This creates a match for the 2nd token "Ci" of the query, and one of the 
"sub"tokens the EdgeNGramFilter created: "Cl".


-robert




On Nov 11, 2010, at 21:34 , Andy wrote:

> Could anyone help me understand what does "Clyde Phillips" appear in the 
> results for "Bill Cl"??
> 
> "Clyde Phillips" doesn't produce any EdgeNGram that would match "Bill Cl", so 
> why is it even in the results?
> 
> Thanks.
> 
> --- On Thu, 11/11/10, Ahmet Arslan <iori...@yahoo.com> wrote:
> 
>> You can add an additional field, with
>> using KeywordTokenizerFactory instead of
>> WhitespaceTokenizerFactory. And query both these fields with
>> an OR operator. 
>> 
>> edgytext:(Bill Cl) OR edgytext2:"Bill Cl"
>> 
>> You can even apply boost so that begins with matches comes
>> first.
>> 
>> --- On Thu, 11/11/10, Robert Gründler <rob...@dubture.com>
>> wrote:
>> 
>>> From: Robert Gründler <rob...@dubture.com>
>>> Subject: EdgeNGram relevancy
>>> To: solr-user@lucene.apache.org
>>> Date: Thursday, November 11, 2010, 5:51 PM
>>> Hi,
>>> 
>>> consider the following fieldtype (used for
>>> autocompletion):
>>> 
>>>   <fieldType name="edgytext"
>> class="solr.TextField"
>>> positionIncrementGap="100">
>>>    <analyzer type="index">
>>>      <tokenizer
>>> class="solr.WhitespaceTokenizerFactory"/>
>>>      <filter
>>> class="solr.LowerCaseFilterFactory"/>
>>>      <filter
>>> class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords.txt" enablePositionIncrements="true"
>>> />     
>>>          <filter
>>> class="solr.PatternReplaceFilterFactory"
>> pattern="([^a-z])"
>>> replacement="" replace="all" />
>>>      <filter
>>> class="solr.EdgeNGramFilterFactory" minGramSize="1"
>>> maxGramSize="25" />
>>>    </analyzer>
>>>    <analyzer type="query">
>>>      <tokenizer
>>> class="solr.WhitespaceTokenizerFactory"/>
>>>      <filter
>>> class="solr.LowerCaseFilterFactory"/>
>>>      <filter
>>> class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords.txt" enablePositionIncrements="true"
>> />
>>>          <filter
>>> class="solr.PatternReplaceFilterFactory"
>> pattern="([^a-z])"
>>> replacement="" replace="all" />
>>>    </analyzer>
>>>   </fieldType>
>>> 
>>> 
>>> This works fine as long as the query string is a
>> single
>>> word. For multiple words, the ranking is weird
>> though.
>>> 
>>> Example:
>>> 
>>> Query String: "Bill Cl"
>>> 
>>> Result (in that order):
>>> 
>>> - Clyde Phillips
>>> - Clay Rogers
>>> - Roger Cloud
>>> - Bill Clinton
>>> 
>>> "Bill Clinton" should have the highest rank in that
>>> case.  
>>> 
>>> Has anyone an idea how to to configure this fieldtype
>> to
>>> make matches in both tokens rank higher than those who
>> match
>>> in either token?
>>> 
>>> 
>>> thanks!
>>> 
>>> 
>>> -robert
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
>> 
> 
> 
>

Re: EdgeNGram relevancy

Reply via email to