Op Friday 25 January 2008 03:46:23 schreef Kyle Maxwell: > > I've been poking around the list archives and didn't really come up against > > anything interesting. Anyone using Lucene to index OCR text? Any > > strategies/algorithms/packages you recommend? > > > > I have a large collection (10^7 docs) that's mostly the result of OCR. We > > index/search/etc. with Lucene without any trouble, but OCR errors are a > > problem, when doing exact phrase matches in particular. I'm looking for > > ideas on how to deal with this thorny problem. > > How about Letter-by-letter ngrams coupled with SpanQueries (or more > likely, a custom query utilizing the TermPositions iterator)? >
There is no way to do exact phrase matching on OCR data, because no correction of OCR data will be perfect. Otherwise the OCR would have made the correction... What you'll need is something like a fuzzy query as the leafs of a phrase query. Also, there may be missing word boundaries, and in that case you'll have to use a truncation query instead of a phrase query. The more fuzzyness introduced in the query, the higher the chance of false matches, so there really is no single answer to this. It depends on how many false matches the users will accept and on how many OCR errors there are. One could start by adding some fuzzy term matching to phrase query, and see what the users think of that. They will lose some performance, and that is another factor in the fuzzyness tradeoff. SpanQueries could be used too, for these a fuzzy term match would need to be added, as well as a query parser. When adding fuzzy term matching to a phrase query looks to be a bit daunting, have a look at the surround query parser in the contrib area. It has truncation and proximity based on span queries, but no fuzzy term matching, so it could also be a start for investigating. It all depends on how good the OCR was, but in some cases (think old paper) it's just not possible to do good OCR. Regards, Paul Elschot --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]