Re: Search for misspelled words in corpus

Otis Gospodnetic Sat, 08 Jun 2013 23:01:26 -0700

Interesting problem.  The first thing that comes to mind is to do
"word expansion" during indexing.  Kind of like synonym expansion, but
maybe a bit more dynamic. If you can have a dictionary of correctly
spelled words, then for each token emitted by the tokenizer you could
look up the dictionary and expand the token to all other words that
are similar/close enough.  This would not be super fast, and you'd
likely have to add some custom heuristic for figuring out what
"similar/close enough" means, but it might work.


I'd love to hear other ideas...

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Wed, Jun 5, 2013 at 9:10 AM, కామేశ్వర రావు భైరవభట్ల
<kamesh...@gmail.com> wrote:
> Hi,
>
> I have a problem where our text corpus on which we need to do search
> contains many misspelled words. Same word could also be misspelled in
> several different ways. It could also have documents that have correct
> spellings However, the search term that we give in query would always be
> correct spelling. Now when we search on a term, we would like to get all
> the documents that contain both correct and misspelled forms of the search
> term.
> We tried fuzzy search, but it doesn't work as per our expectations. It
> returns any close match, not specifically misspelled words. For example, if
> I'm searching for a word like "fight", I would like to return the documents
> that have words like "figth" and "feight", not documents with words like
> "sight" and "light".
> Is there any suggested approach for doing this?
>
> regards,
> Kamesh

Re: Search for misspelled words in corpus

Reply via email to