RE: NGramTokenFilter behaviour

Feak, Todd Wed, 30 Sep 2009 11:22:33 -0700

My understanding of a NGramTokenizing is to help with languages that don't 
necessarily contain spaces as a word delimiter (Japanese et al). In that case 
bi-gramming is used to find words contained within a stream of unbroken 
characters. In that case, you want to find all of the bi-grams that you input 
for the search query. An "OR" wouldn't work as well, as you would find tons of 
hits.

-Todd Feak

-----Original Message-----
From: aod...@gmail.com [mailto:aod...@gmail.com] 
Sent: Wednesday, September 30, 2009 10:54 AM
To: solr-user@lucene.apache.org
Subject: NGramTokenFilter behaviour

If I index the following text: "I live in Dublin Ireland where
Guinness is brewed"

Then search for: duvlin

Should Solr return a match?

In the admin interface under the analysis section, Solr highlights
some NGram matches?

When I enter the following query string into my browser address bar, I
get 0 results?

http://localhost:8983/solr/select/?q=duvlin&debugQuery=true

Nor do I get results for dub, dubli, ublin, dublin (du does return a result).

I also notice when I use debugQuery=true, the parsed query is a
PhraseQuery. This doesn't make sense to me, as surely the point of the
NGram is to use a Boolean OR between each Gram??

However, if I don't use an NGramFilterFactory at query time, I can get
results for: dub, ublin, du, but not duvlin.

<fieldType name="text" class="solr.TextField">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.NGramFilterFactory" minGramSize="2"
maxGramSize="15"/>
      </analyzer>
</fieldType>

Can someone please clarify what the purpose of the
NGramFilter/tokenizer is, if not to allow for
misspellings/morphological variation and also, what the correct
configuration is in terms of use at index/query time.

Any help appreciated!

Aodh.

Solr 1.3, JDK 1.6

RE: NGramTokenFilter behaviour

Reply via email to