Interesting problem. The first thing that comes to mind is to do "word expansion" during indexing. Kind of like synonym expansion, but maybe a bit more dynamic. If you can have a dictionary of correctly spelled words, then for each token emitted by the tokenizer you could look up the dictionary and expand the token to all other words that are similar/close enough. This would not be super fast, and you'd likely have to add some custom heuristic for figuring out what "similar/close enough" means, but it might work.
I'd love to hear other ideas... Otis -- Solr & ElasticSearch Support http://sematext.com/ On Wed, Jun 5, 2013 at 9:10 AM, కామేశ్వర రావు భైరవభట్ల <kamesh...@gmail.com> wrote: > Hi, > > I have a problem where our text corpus on which we need to do search > contains many misspelled words. Same word could also be misspelled in > several different ways. It could also have documents that have correct > spellings However, the search term that we give in query would always be > correct spelling. Now when we search on a term, we would like to get all > the documents that contain both correct and misspelled forms of the search > term. > We tried fuzzy search, but it doesn't work as per our expectations. It > returns any close match, not specifically misspelled words. For example, if > I'm searching for a word like "fight", I would like to return the documents > that have words like "figth" and "feight", not documents with words like > "sight" and "light". > Is there any suggested approach for doing this? > > regards, > Kamesh