I am processing a bunch of text coming out of OCR, i.e. it's machine-generated text that contains some errors like garbage characters attached to words, letters replaced with similarly looking characters (e.g. "I" with "1") etc. The text is whitespace-tokenized and I am trying to match each token against an index using a fuzzy match, so that small amounts of occasional garbage in the tokens do not prevent a match.
Right now I am preprocessing each query as follows: //term = token Query queryF = parser.Parse(term.Replace("~", "") + "~"); However, searcher.Search still throws "can't parse" exceptions for queries that contain brackets, quotes and other garbage characters. So how should I fully preprocess a query to avoid these exceptions? Looks like I just need to remove a certain set of characters just like the tilde is removed above. What is the complete set of such characters? Do I need to do any other preprocess? Thanks, Ilya Zavorin