My understanding of a NGramTokenizing is to help with languages that don't necessarily contain spaces as a word delimiter (Japanese et al). In that case bi-gramming is used to find words contained within a stream of unbroken characters. In that case, you want to find all of the bi-grams that you input for the search query. An "OR" wouldn't work as well, as you would find tons of hits.
-Todd Feak -----Original Message----- From: aod...@gmail.com [mailto:aod...@gmail.com] Sent: Wednesday, September 30, 2009 10:54 AM To: solr-user@lucene.apache.org Subject: NGramTokenFilter behaviour If I index the following text: "I live in Dublin Ireland where Guinness is brewed" Then search for: duvlin Should Solr return a match? In the admin interface under the analysis section, Solr highlights some NGram matches? When I enter the following query string into my browser address bar, I get 0 results? http://localhost:8983/solr/select/?q=duvlin&debugQuery=true Nor do I get results for dub, dubli, ublin, dublin (du does return a result). I also notice when I use debugQuery=true, the parsed query is a PhraseQuery. This doesn't make sense to me, as surely the point of the NGram is to use a Boolean OR between each Gram?? However, if I don't use an NGramFilterFactory at query time, I can get results for: dub, ublin, du, but not duvlin. <fieldType name="text" class="solr.TextField"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="15"/> </analyzer> </fieldType> Can someone please clarify what the purpose of the NGramFilter/tokenizer is, if not to allow for misspellings/morphological variation and also, what the correct configuration is in terms of use at index/query time. Any help appreciated! Aodh. Solr 1.3, JDK 1.6