n-grams might help, followed by a edit distance metric such as Jaro-Winkler or Smith-Waterman-Gotoh to further filter out.
On Sun, Jun 9, 2013 at 1:59 AM, Otis Gospodnetic <otis.gospodne...@gmail.com > wrote: > Interesting problem. The first thing that comes to mind is to do > "word expansion" during indexing. Kind of like synonym expansion, but > maybe a bit more dynamic. If you can have a dictionary of correctly > spelled words, then for each token emitted by the tokenizer you could > look up the dictionary and expand the token to all other words that > are similar/close enough. This would not be super fast, and you'd > likely have to add some custom heuristic for figuring out what > "similar/close enough" means, but it might work. > > I'd love to hear other ideas... > > Otis > -- > Solr & ElasticSearch Support > http://sematext.com/ > > > > > > On Wed, Jun 5, 2013 at 9:10 AM, కామేశ్వర రావు భైరవభట్ల > <kamesh...@gmail.com> wrote: > > Hi, > > > > I have a problem where our text corpus on which we need to do search > > contains many misspelled words. Same word could also be misspelled in > > several different ways. It could also have documents that have correct > > spellings However, the search term that we give in query would always be > > correct spelling. Now when we search on a term, we would like to get all > > the documents that contain both correct and misspelled forms of the > search > > term. > > We tried fuzzy search, but it doesn't work as per our expectations. It > > returns any close match, not specifically misspelled words. For example, > if > > I'm searching for a word like "fight", I would like to return the > documents > > that have words like "figth" and "feight", not documents with words like > > "sight" and "light". > > Is there any suggested approach for doing this? > > > > regards, > > Kamesh >