Hi Manuel, I think OCR error correction is one of well-known NLP tasks. I'd thought it could be implemented in the past by using Lucene.
This is a brief idea: 1. You have got a Lucene index. This existing index is made from correct (i.e. error free) documents that are same domain of OCR documents. 2. Tokenize OCR text by ShingleTokenizer. By ShingleTokenizer, you'll get: the quiok tlne quick the quick : 3. Search those phrase in the existing index. I think exact search (PhraseQuery) or FuzzyQuery can be worked. You should get the highest hit count when searching "the quick" among those phrases. Koji -- http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html (2014/07/02 7:19), Manuel Le Normand wrote:
Hello, Many of our indexed documents are scanned and OCR'ed documents. Unfortunately we were not able to improve much the OCR quality (less than 80% word accuracy) for various reasons, a fact which badly hurts the retrieval quality. As we use an open-source OCR, we think of changing every scanned term output to it's main possible variations to get a higher level of confidence. Is there any analyser that supports this kind of need or should I make up a syntax and analyser of my own, i.e the payload syntax? The quick brown fox --> The|1 Tlne|1 quick|2 quiok|2 browm|3 brown|3 fox|4 Thanks, Manuel