Lucene.NET based text triage

Ilya Zavorin Tue, 21 Aug 2012 10:51:13 -0700

I have the following task that I need to implement in .NET. I get a block of 
text and need to assess whether this text is mostly readable or a bunch of 
unreadable garbage. This text is generated by processes like OCR. I am not 
looking to detect or correct small errors. Instead, I need to "triage" the text 
block and return TRUE if the whole block is more or less readable (as well as 
searchable etc) or FALSE if it's mostly garbage.


My current plan is to:

1.       Use Lucene.NET to index a large dictionary of English words

2.       Tokenize the text, throwing out stopwords, words shorter than some 
minimum # of chars

3.       Query each token against the index using some sort of fuzzy match that 
would give me not only the closest match to a given token from the dict but 
also the distance

4.       Somehow combine individual distances to come up with a cumulative 
measure for the whole block of text

5.       Compare it against some threshold and return FALSE if the measure is 
above the threshold and TRUE otherwise.

Here are some questions:

1.       Is there anything special I need to do during indexing of the 
dictionary to make the fuzzy matching work better?

2.       What sort of fuzzy matching methods are available in Lucene.NET 
querying? Do they return distances for the closest matches? Does the choice of 
a matching method affect how indexing should be done?

3.       Is there a way of running the whole block of text against the index at 
once rather than tokenizing and looping over tokens?

Thanks much,

Ilya Zavorin

Lucene.NET based text triage

Reply via email to