NGram for misspelt words

2012-07-18 Thread Husain, Yavar



I have configured NGram Indexing for some fields.

Say I search for the city Ludlow, I get the results (normal search)

If I search for Ludlo (with w ommitted) I get the results

If I search for Ludl (with ow ommitted) I still get the results

I know that they are all partial strings of the main string hence NGram works 
perfect.

But when I type in Ludlwo (misspelt, characters o and w interchanged) I dont 
get any results, It should ideally match Ludl and provide the results.

I am not looking for Edit distance based Spell Correctors. How can I make above 
NGram based search work?

Here is my schema.xml (NGramFieldType):

fieldType name=nGram class=solr.TextField positionIncrementGap=100 
stored=false multiValued=true

analyzer type=index

tokenizer class=solr.StandardTokenizerFactory/

!-- potentially word delimiter, synonym filter, stop words, NOT stemming --

filter class=solr.LowerCaseFilterFactory/

filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=15 
side=front /



/analyzer

analyzer type=query

tokenizer class=solr.StandardTokenizerFactory/

!-- potentially word delimiter, synonym filter, stop words, NOT stemming --

filter class=solr.LowerCaseFilterFactory/

/analyzer

/fieldType


/PRE
BR
**BRThis
 message may contain confidential or proprietary information intended only for 
the use of theBRaddressee(s) named above or may contain information that is 
legally privileged. If you areBRnot the intended addressee, or the person 
responsible for delivering it to the intended addressee,BRyou are hereby 
notified that reading, disseminating, distributing or copying this message is 
strictlyBRprohibited. If you have received this message by mistake, please 
immediately notify us byBRreplying to the message and delete the original 
message and any copies immediately thereafter.BR
BR
Thank you.~BR
**BR
FAFLDBR
PRE


Re: NGram for misspelt words

2012-07-18 Thread Dikchant Sahi
You are creating grams only while indexing and not querying hence 'ludlwo'
would not match. Your analyzer will create the following grams while
indexing for 'ludlow': lu lud ludl ludlo ludlow and hence would not match
to 'ludlwo'.

Either you need to create gram while querying also or use Edit Distance.

On Wed, Jul 18, 2012 at 7:43 PM, Husain, Yavar yhus...@firstam.com wrote:




 I have configured NGram Indexing for some fields.

 Say I search for the city Ludlow, I get the results (normal search)

 If I search for Ludlo (with w ommitted) I get the results

 If I search for Ludl (with ow ommitted) I still get the results

 I know that they are all partial strings of the main string hence NGram
 works perfect.

 But when I type in Ludlwo (misspelt, characters o and w interchanged) I
 dont get any results, It should ideally match Ludl and provide the
 results.

 I am not looking for Edit distance based Spell Correctors. How can I make
 above NGram based search work?

 Here is my schema.xml (NGramFieldType):

 fieldType name=nGram class=solr.TextField positionIncrementGap=100
 stored=false multiValued=true

 analyzer type=index

 tokenizer class=solr.StandardTokenizerFactory/

 !-- potentially word delimiter, synonym filter, stop words, NOT stemming
 --

 filter class=solr.LowerCaseFilterFactory/

 filter class=solr.EdgeNGramFilterFactory minGramSize=2
 maxGramSize=15 side=front /



 /analyzer

 analyzer type=query

 tokenizer class=solr.StandardTokenizerFactory/

 !-- potentially word delimiter, synonym filter, stop words, NOT stemming
 --

 filter class=solr.LowerCaseFilterFactory/

 /analyzer

 /fieldType


 /PRE
 BR
 **BRThis
 message may contain confidential or proprietary information intended only
 for the use of theBRaddressee(s) named above or may contain information
 that is legally privileged. If you areBRnot the intended addressee, or
 the person responsible for delivering it to the intended addressee,BRyou
 are hereby notified that reading, disseminating, distributing or copying
 this message is strictlyBRprohibited. If you have received this message
 by mistake, please immediately notify us byBRreplying to the message and
 delete the original message and any copies immediately thereafter.BR
 BR
 Thank you.~BR

 **BR
 FAFLDBR
 PRE



RE: NGram for misspelt words

2012-07-18 Thread Husain, Yavar
Thanks Sahi. I have replaced my EdgeNGramFilterFactory to NGramFilterFactory as 
I need substrings not just in front or back but anywhere.
You are right I put the same NGramFilterFactory in both Query and Index however 
now it does not return any results not even the basic one.

-Original Message-
From: Dikchant Sahi [mailto:contacts...@gmail.com] 
Sent: Wednesday, July 18, 2012 7:54 PM
To: solr-user@lucene.apache.org
Subject: Re: NGram for misspelt words

You are creating grams only while indexing and not querying hence 'ludlwo'
would not match. Your analyzer will create the following grams while indexing 
for 'ludlow': lu lud ludl ludlo ludlow and hence would not match to 'ludlwo'.

Either you need to create gram while querying also or use Edit Distance.

On Wed, Jul 18, 2012 at 7:43 PM, Husain, Yavar yhus...@firstam.com wrote:




 I have configured NGram Indexing for some fields.

 Say I search for the city Ludlow, I get the results (normal search)

 If I search for Ludlo (with w ommitted) I get the results

 If I search for Ludl (with ow ommitted) I still get the results

 I know that they are all partial strings of the main string hence 
 NGram works perfect.

 But when I type in Ludlwo (misspelt, characters o and w interchanged) 
 I dont get any results, It should ideally match Ludl and provide the 
 results.

 I am not looking for Edit distance based Spell Correctors. How can I 
 make above NGram based search work?

 Here is my schema.xml (NGramFieldType):

 fieldType name=nGram class=solr.TextField positionIncrementGap=100
 stored=false multiValued=true

 analyzer type=index

 tokenizer class=solr.StandardTokenizerFactory/

 !-- potentially word delimiter, synonym filter, stop words, NOT 
 stemming
 --

 filter class=solr.LowerCaseFilterFactory/

 filter class=solr.EdgeNGramFilterFactory minGramSize=2
 maxGramSize=15 side=front /



 /analyzer

 analyzer type=query

 tokenizer class=solr.StandardTokenizerFactory/

 !-- potentially word delimiter, synonym filter, stop words, NOT 
 stemming
 --

 filter class=solr.LowerCaseFilterFactory/

 /analyzer

 /fieldType


 /PRE
 BR
 **
 BRThis message may contain confidential or 
 proprietary information intended only for the use of 
 theBRaddressee(s) named above or may contain information that is 
 legally privileged. If you areBRnot the intended addressee, or the 
 person responsible for delivering it to the intended addressee,BRyou 
 are hereby notified that reading, disseminating, distributing or 
 copying this message is strictlyBRprohibited. If you have received 
 this message by mistake, please immediately notify us byBRreplying 
 to the message and delete the original message and any copies 
 immediately thereafter.BR BR Thank you.~BR

 **
 BR
 FAFLDBR
 PRE



Re: NGram for misspelt words

2012-07-18 Thread Dikchant Sahi
Have you tried the analysis window to debug.

I believe you are doing something wrong in the fieldType.

On Wed, Jul 18, 2012 at 8:07 PM, Husain, Yavar yhus...@firstam.com wrote:

 Thanks Sahi. I have replaced my EdgeNGramFilterFactory to
 NGramFilterFactory as I need substrings not just in front or back but
 anywhere.
 You are right I put the same NGramFilterFactory in both Query and Index
 however now it does not return any results not even the basic one.

 -Original Message-
 From: Dikchant Sahi [mailto:contacts...@gmail.com]
 Sent: Wednesday, July 18, 2012 7:54 PM
 To: solr-user@lucene.apache.org
 Subject: Re: NGram for misspelt words

 You are creating grams only while indexing and not querying hence 'ludlwo'
 would not match. Your analyzer will create the following grams while
 indexing for 'ludlow': lu lud ludl ludlo ludlow and hence would not match
 to 'ludlwo'.

 Either you need to create gram while querying also or use Edit Distance.

 On Wed, Jul 18, 2012 at 7:43 PM, Husain, Yavar yhus...@firstam.com
 wrote:

 
 
 
  I have configured NGram Indexing for some fields.
 
  Say I search for the city Ludlow, I get the results (normal search)
 
  If I search for Ludlo (with w ommitted) I get the results
 
  If I search for Ludl (with ow ommitted) I still get the results
 
  I know that they are all partial strings of the main string hence
  NGram works perfect.
 
  But when I type in Ludlwo (misspelt, characters o and w interchanged)
  I dont get any results, It should ideally match Ludl and provide the
  results.
 
  I am not looking for Edit distance based Spell Correctors. How can I
  make above NGram based search work?
 
  Here is my schema.xml (NGramFieldType):
 
  fieldType name=nGram class=solr.TextField positionIncrementGap=100
  stored=false multiValued=true
 
  analyzer type=index
 
  tokenizer class=solr.StandardTokenizerFactory/
 
  !-- potentially word delimiter, synonym filter, stop words, NOT
  stemming
  --
 
  filter class=solr.LowerCaseFilterFactory/
 
  filter class=solr.EdgeNGramFilterFactory minGramSize=2
  maxGramSize=15 side=front /
 
 
 
  /analyzer
 
  analyzer type=query
 
  tokenizer class=solr.StandardTokenizerFactory/
 
  !-- potentially word delimiter, synonym filter, stop words, NOT
  stemming
  --
 
  filter class=solr.LowerCaseFilterFactory/
 
  /analyzer
 
  /fieldType
 
 
  /PRE
  BR
  **
  BRThis message may contain confidential or
  proprietary information intended only for the use of
  theBRaddressee(s) named above or may contain information that is
  legally privileged. If you areBRnot the intended addressee, or the
  person responsible for delivering it to the intended addressee,BRyou
  are hereby notified that reading, disseminating, distributing or
  copying this message is strictlyBRprohibited. If you have received
  this message by mistake, please immediately notify us byBRreplying
  to the message and delete the original message and any copies
  immediately thereafter.BR BR Thank you.~BR
 
  **
  BR
  FAFLDBR
  PRE