I have the following task that I need to implement in .NET. I get a block of text and need to assess whether this text is mostly readable or a bunch of unreadable garbage. This text is generated by processes like OCR. I am not looking to detect or correct small errors. Instead, I need to "triage" the text block and return TRUE if the whole block is more or less readable (as well as searchable etc) or FALSE if it's mostly garbage.
My current plan is to: 1. Use Lucene.NET to index a large dictionary of English words 2. Tokenize the text, throwing out stopwords, words shorter than some minimum # of chars 3. Query each token against the index using some sort of fuzzy match that would give me not only the closest match to a given token from the dict but also the distance 4. Somehow combine individual distances to come up with a cumulative measure for the whole block of text 5. Compare it against some threshold and return FALSE if the measure is above the threshold and TRUE otherwise. Here are some questions: 1. Is there anything special I need to do during indexing of the dictionary to make the fuzzy matching work better? 2. What sort of fuzzy matching methods are available in Lucene.NET querying? Do they return distances for the closest matches? Does the choice of a matching method affect how indexing should be done? 3. Is there a way of running the whole block of text against the index at once rather than tokenizing and looping over tokens? Thanks much, Ilya Zavorin