I've been poking around the list archives and didn't really come up against
anything interesting. Anyone using Lucene to index OCR text? Any
strategies/algorithms/packages you recommend?
 
I have a large collection (10^7 docs) that's mostly the result of OCR. We
index/search/etc. with Lucene without any trouble, but OCR errors are a
problem, when doing exact phrase matches in particular. I'm looking for
ideas on how to deal with this thorny problem.
 
--
Renaud Waldura
Applications Group Manager
Library and Center for Knowledge Management
University of California, San Francisco
(415) 502-6660

 

Reply via email to