Another theoretical answer for this question is ngrams approach. You can index the word and its trigrams. Query the index, by the string as well as its trigrams, with a % match search. You than pass the exhaustive resultset through a more expensive scoring such as Smith Waterman.
Thanks, Jagdish On Sat, Jun 8, 2013 at 11:03 PM, Shashi Kant <sk...@sloan.mit.edu> wrote: > n-grams might help, followed by a edit distance metric such as Jaro-Winkler > or Smith-Waterman-Gotoh to further filter out. > > > On Sun, Jun 9, 2013 at 1:59 AM, Otis Gospodnetic < > otis.gospodne...@gmail.com > > wrote: > > > Interesting problem. The first thing that comes to mind is to do > > "word expansion" during indexing. Kind of like synonym expansion, but > > maybe a bit more dynamic. If you can have a dictionary of correctly > > spelled words, then for each token emitted by the tokenizer you could > > look up the dictionary and expand the token to all other words that > > are similar/close enough. This would not be super fast, and you'd > > likely have to add some custom heuristic for figuring out what > > "similar/close enough" means, but it might work. > > > > I'd love to hear other ideas... > > > > Otis > > -- > > Solr & ElasticSearch Support > > http://sematext.com/ > > > > > > > > > > > > On Wed, Jun 5, 2013 at 9:10 AM, కామేశ్వర రావు భైరవభట్ల > > <kamesh...@gmail.com> wrote: > > > Hi, > > > > > > I have a problem where our text corpus on which we need to do search > > > contains many misspelled words. Same word could also be misspelled in > > > several different ways. It could also have documents that have correct > > > spellings However, the search term that we give in query would always > be > > > correct spelling. Now when we search on a term, we would like to get > all > > > the documents that contain both correct and misspelled forms of the > > search > > > term. > > > We tried fuzzy search, but it doesn't work as per our expectations. It > > > returns any close match, not specifically misspelled words. For > example, > > if > > > I'm searching for a word like "fight", I would like to return the > > documents > > > that have words like "figth" and "feight", not documents with words > like > > > "sight" and "light". > > > Is there any suggested approach for doing this? > > > > > > regards, > > > Kamesh > > > -- ***Jagdish Nomula* Sr. Manager Search Simply Hired, Inc. 370 San Aleso Ave., Ste 200 Sunnyvale, CA 94085 office - 408.400.4700 cell - 408.431.2916 email - jagd...@simplyhired.com <yourem...@simplyhired.com> www.simplyhired.com