Re: Spell checking ?'s

Sean Timm Mon, 25 Feb 2008 08:22:27 -0800

As I don't work for Google, I can only guess. :-) When they think thatyou have spelled something incorrectly, they seem to also search forwhat they deem to be the correct spelling. In this particular case,there are two "Abdur Chowdhury's" of some fame. One is the IRscientist, the other is a published economist.

If you make it a phrase query: "abdur chowdhury" vs. "abdur choudhury",there are 8,240 hits for the former versus 26 hits for the latter.


-Sean

Otis Gospodnetic wrote:

Aha, good example, Sean.  What's the explanation?  Note that doing:
    http://www.google.com/search?q=abdur+choudhury
offers this alternative:
    http://www.google.com/searchq=abdur+chowdhury

And that the number of hits is approximately the same in both cases and that 
Google is smart enough to search for and highlight chowdhury even when the 
search was for choudhury.

Google's spelling corrections/suggestions are driven off of massive query 
(refinement) logs.  Solr's suggestions are based on the index field content.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Sean Timm <[EMAIL PROTECTED]>
To: solr-dev@lucene.apache.org
Sent: Friday, February 22, 2008 4:03:58 PM
Subject: Re: Spell checking ?'s
Sometimes context can play into the correct spelling of a term. Ihaven't looked at the 1.3 spell check stuff, but it would be nice to doterm n-gramming in order to check the terms in context.
Since Otis brought up Google, here is an example of putting the terminto context.
http://www.google.com/search?q=choudhury
http://www.google.com/search?q=abdur+choudhury

-Sean

Otis Gospodnetic wrote:
Haven't used SCRH in a while, but what you are describing sounds right
(thinking about how Google does it) - each word should be checked separately andwe shouldn't assume splitting on whitespace. I'm trying to think if there arecases where you'd want to look at the surrounding terms instead of looking ateach term in isolation.... can think of anything exciting....maybe ensure thatwords with dashes are properly handled.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Grant IngersollTo: solr-dev@lucene.apache.org
Sent: Thursday, February 21, 2008 3:13:20 PM
Subject: Spell checking ?'s

Hi,
I've been looking a bit at the spell checker and the implementation inthe SpellCheckerRequestHandler and I have some questions.
In looking at the code and the wiki, the SpellChecker seems to treatmultiword queries differently depending on whether extendedResults istrue or not. Is the use case a multiword query or a single wordquery? It seems like one would want to pass the whole query to thespell checker and have it come back with results for each word, bydefault. Otherwise, the application would need to do the tokenizationand send each term one by one to the spell checker. However, the applikely doesn't have access to the spell check tokenizer, so this isdifficult.
Which leads me to the next question, in the extendedResults, shouldn'tit use the Query analyzer for the spellcheck field to tokenize theterms instead of splitting on the space character?
Would it make sense to, for extendedResults anyway, do the following:
Tokenize the query using the query analyzer for the spelling field
for each token
    spell check the token
    add the results
I see that extendedResults is a 1.3 addition, so we would be fine tochange it, if it makes sense.
Perhaps, for back compatibility, we keep the existing way for nonextendedResults. However, it seems like multiword queries should besplit even in the non-extended results, but I am not sure. How areothers using it?
Thanks,
Grant

Re: Spell checking ?'s

Reply via email to