RE: How to best handle search like Dave & David

Susheel Kumar Mon, 03 Mar 2014 09:55:42 -0800

Thanks, Arun for sharing the idea on EdgeNGramFilter. In our case we are doing 
search using automated process so may EdgeNGramFilter may not work.  Wwe have 
used NGramFilterFactory in the past but will look into it again.

For cases like Dave & David and other English names does anyone has  idea which 
stemmer (currently using PorterStemFilterFactory) can work better? 

-----Original Message-----
From: Arun Rangarajan [mailto:arunrangara...@gmail.com] 
Sent: Sunday, March 02, 2014 1:47 PM
To: solr-user@lucene.apache.org
Subject: Re: How to best handle search like Dave & David

If you are trying to serve results as users are typing, then you can use 
EdgeNGramFilter (see 
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory
).

Let's say you configure your field like this, as shown in the Solr wiki:

<fieldType name="text_general_edge_ngram" class="solr.TextField"
positionIncrementGap="100">
   <analyzer type="index">
      <tokenizer class="solr.LowerCaseTokenizerFactory"/>
      <filter class="solr.EdgeNGramFilterFactory" minGramSize="2"
maxGramSize="15" side="front"/>
   </analyzer>
   <analyzer type="query">
      <tokenizer class="solr.LowerCaseTokenizerFactory"/>
   </analyzer>
</fieldType>

Then this is what happens at index time for your tokens:

David ---> | LowerCaseTokenizerFactory | ---> david ---> | 
EdgeNGramFilterFactory
| ---> da dav davi david
Dave ---> | LowerCaseTokenizerFactory | ---> dave ---> | EdgeNGramFilterFactory
| ---> da dav dave

And at query time, when your user enters 'Dav' it will match both those tokens. 
Note that the moment your user starts typing more, say 'davi' it won't match 
'Dave' since you are doing edge N gramming only at index time and not at query 
time. You can also do edge N gramming at query time if you want 'Dave' to match 
'David', probably keeping a larger minGramSize (in this case 3) to avoid noise 
(like say 'Dave' matching 'Dana' though with a lower score), but it will be 
expensive to do n-gramming at query time.

On Fri, Feb 28, 2014 at 3:22 PM, Susheel Kumar < 
susheel.ku...@thedigitalgroup.net> wrote:

> Hi,
>
> We have name searches on Solr for millions of documents. User may 
> search like "Morrison Dave" or other may search like "Morrison David".  
> What's the best way to handle that both brings similar results. Adding 
> Synonym is the option we are using right.
>
> But we may need to add around such 50,000+ synonyms for different 
> names for each specific name there can be couple of synonyms like for 
> Richard, it can be Rich, Rick, Richie etc.
>
> Any experience adding so many synonyms or any other thoughts? Stemming 
> may help in few situations but not like Dave and David.
>
> Thanks,
> Susheel
>

RE: How to best handle search like Dave & David

Reply via email to