I am processing a bunch of text coming out of OCR, i.e. it's machine-generated 
text that contains some errors like garbage characters attached to words, 
letters replaced with similarly looking characters (e.g. "I" with "1") etc. The 
text is whitespace-tokenized and I am trying to match each token against an 
index using a fuzzy match, so that small amounts of occasional garbage in the 
tokens do not prevent a match.

Right now I am preprocessing each query as follows:

//term = token
Query queryF = parser.Parse(term.Replace("~", "") + "~");

However, searcher.Search still throws "can't parse" exceptions for queries that 
contain brackets, quotes and other garbage characters.

So how should I fully preprocess a query to avoid these exceptions?

Looks like I just need to remove a certain set of characters just like the 
tilde is removed above. What is the complete set of such characters? Do I need 
to do any other preprocess?

Thanks,

Ilya Zavorin

Reply via email to