Re: Search for misspelled words in corpus

Shashi Kant Sat, 08 Jun 2013 23:04:26 -0700

n-grams might help, followed by a edit distance metric such as Jaro-Winkler
or Smith-Waterman-Gotoh to further filter out.



On Sun, Jun 9, 2013 at 1:59 AM, Otis Gospodnetic <otis.gospodne...@gmail.com
> wrote:

> Interesting problem.  The first thing that comes to mind is to do
> "word expansion" during indexing.  Kind of like synonym expansion, but
> maybe a bit more dynamic. If you can have a dictionary of correctly
> spelled words, then for each token emitted by the tokenizer you could
> look up the dictionary and expand the token to all other words that
> are similar/close enough.  This would not be super fast, and you'd
> likely have to add some custom heuristic for figuring out what
> "similar/close enough" means, but it might work.
>
> I'd love to hear other ideas...
>
> Otis
> --
> Solr & ElasticSearch Support
> http://sematext.com/
>
>
>
>
>
> On Wed, Jun 5, 2013 at 9:10 AM, కామేశ్వర రావు భైరవభట్ల
> <kamesh...@gmail.com> wrote:
> > Hi,
> >
> > I have a problem where our text corpus on which we need to do search
> > contains many misspelled words. Same word could also be misspelled in
> > several different ways. It could also have documents that have correct
> > spellings However, the search term that we give in query would always be
> > correct spelling. Now when we search on a term, we would like to get all
> > the documents that contain both correct and misspelled forms of the
> search
> > term.
> > We tried fuzzy search, but it doesn't work as per our expectations. It
> > returns any close match, not specifically misspelled words. For example,
> if
> > I'm searching for a word like "fight", I would like to return the
> documents
> > that have words like "figth" and "feight", not documents with words like
> > "sight" and "light".
> > Is there any suggested approach for doing this?
> >
> > regards,
> > Kamesh
>

Re: Search for misspelled words in corpus

Reply via email to