RE: frequent terms - Re: combining open office spellchecker with Lucene

Aad Nales Wed, 15 Sep 2004 23:07:06 -0700

Also,

You can also use an alternative spellchecker for the 'checking part' and
use the Ngram algorithm for the 'suggestion' part. Only if the spell
'check' declares a word illegal the 'suggestion' part would perform its
magic.



cheers,
Aad

Doug Cutting wrote:

> David Spencer wrote:
> 
>> [1] The user enters a query like:
>>     recursize descent parser
>>
>> [2] The search code parses this and sees that the 1st word is not a
>> term in the index, but the next 2 are. So it ignores the last 2 terms

>> ("recursive" and "descent") and suggests alternatives to 
>> "recursize"...thus if any term is in the index, regardless of 
>> frequency,  it is left as-is.
>>
>> I guess you're saying that, if the user enters a term that appears in
>> the index and thus is sort of spelled correctly ( as it exists in
some 
>> doc), then we use the heuristic that any sufficiently large doc 
>> collection will have tons of misspellings, so we assume that rare 
>> terms in the query might be misspelled (i.e. not what the user 
>> intended) and we suggest alternativies to these words too (in
addition 
>> to the words in the query that are not in the index at all).
> 
> 
> Almost.
> 
> If the user enters "a recursize purser", then: "a", which is in, say,
>  >50% of the documents, is probably spelled correctly and "recursize",

> which is in zero documents, is probably mispelled.  But what about 
> "purser"?  If we run the spell check algorithm on "purser" and
generate 
> "parser", should we show it to the user?  If "purser" occurs in 1% of 
> documents and "parser" occurs in 5%, then we probably should, since 
> "parser" is a more common word than "purser".  But if "parser" only 
> occurs in 1% of the documents and purser occurs in 5%, then we
probably 
> shouldn't bother suggesting "parser".
> 
> If you wanted to get really fancy then you could check how frequently
> combinations of query terms occur, i.e., does "purser" or "parser"
occur 
> more frequently near "descent".  But that gets expensive.

I updated the code to have an optional popularity filter - if true then 
it only returns matches more popular (frequent) than the word that is 
passed in for spelling correction.

If true (default) then for common words like "remove", no results are 
returned now, as expected:

http://www.searchmorph.com/kat/spell.jsp?s=remove

But if you set it to false (bottom slot in the form at the bottom of the

page) then the algorithm happily looks for alternatives:

http://www.searchmorph.com/kat/spell.jsp?s=remove&min=2&max=5&maxd=5&max
r=10&bstart=2.0&bend=1.0&btranspose=1.0&popular=0

TBD I need to update the javadoc & repost the code I guess. Also as per 
earlier post I also store simple transpositions for words in the 
ngram-index.

-- Dave

> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: frequent terms - Re: combining open office spellchecker with Lucene

Reply via email to