Re: Lucene to index OCR text

2008-01-29 Thread mark harwood
org Sent: Tuesday, 29 January, 2008 8:00:56 AM Subject: Re: Lucene to index OCR text Op Tuesday 29 January 2008 03:32:08 schreef Daniel Noll: > On Friday 25 January 2008 19:26:44 Paul Elschot wrote: > > There is no way to do exact phrase matching on OCR data, be

Re: Lucene to index OCR text

2008-01-29 Thread Paul Elschot
Op Tuesday 29 January 2008 03:32:08 schreef Daniel Noll: > On Friday 25 January 2008 19:26:44 Paul Elschot wrote: > > There is no way to do exact phrase matching on OCR data, because no > > correction of OCR data will be perfect. Otherwise the OCR would have made > > the correction... > > > > The

Re: Lucene to index OCR text

2008-01-28 Thread Daniel Noll
On Friday 25 January 2008 19:26:44 Paul Elschot wrote: > There is no way to do exact phrase matching on OCR data, because no > correction of OCR data will be perfect. Otherwise the OCR would have made > the correction... > The problem I see with a fuzzy query is that if you have the fuzziness set

RE: Lucene to index OCR text

2008-01-25 Thread Renaud Waldura
PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Friday, January 25, 2008 7:31 AM To: java-user@lucene.apache.org Subject: Re: Lucene to index OCR text Thanks everyone for their ideas and suggestions! Some had occurred to us but were discarded because we feel our solution needs to be automated -- 45 million

Re: Lucene to index OCR text

2008-01-25 Thread waldura
Thanks everyone for their ideas and suggestions! Some had occurred to us but were discarded because we feel our solution needs to be automated -- 45 million pages are a lot of thrust on any human-driven effort. I like Itamar's idea of doing "competing" OCR, and keeping the best result. Unfortunate

Re: Lucene to index OCR text

2008-01-25 Thread Erick Erickson
enaud Waldura <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Friday, 25 January, 2008 1:43:06 AM > Subject: Lucene to index OCR text > > I've > been > poking > around > the > list > archives > and > didn't > really > come > up &g

Re: Lucene to index OCR text

2008-01-25 Thread mark harwood
Lucene to index OCR text I've been poking around the list archives and didn't really come up against anything interesting. Anyone using Lucene to index OCR text? Any strategies/algorithms/packages you recommend? I have a large collection (10^7 docs) tha

RE: Lucene to index OCR text

2008-01-25 Thread Itamar Syn-Hershko
re too big? Itamar. -Original Message- From: Paul Elschot [mailto:[EMAIL PROTECTED] Sent: Friday, January 25, 2008 10:27 AM To: java-user@lucene.apache.org Subject: Re: Lucene to index OCR text Op Friday 25 January 2008 03:46:23 schreef Kyle Maxwell: > > I've been poking around

Re: Lucene to index OCR text

2008-01-25 Thread Paul Elschot
Op Friday 25 January 2008 03:46:23 schreef Kyle Maxwell: > > I've been poking around the list archives and didn't really come up against > > anything interesting. Anyone using Lucene to index OCR text? Any > > strategies/algorithms/packages you recommend? > > &

Re: Lucene to index OCR text

2008-01-24 Thread Kyle Maxwell
> I've been poking around the list archives and didn't really come up against > anything interesting. Anyone using Lucene to index OCR text? Any > strategies/algorithms/packages you recommend? > > I have a large collection (10^7 docs) that's mostly the result of OC

Re: Lucene to index OCR text

2008-01-24 Thread Erick Erickson
ng you come across. Especially in the way of cleaning existing OCRd data. Mostly, I'm expressing sympathy for the size and complexity of the task you're undertaking .. Best Erick On Jan 24, 2008 8:43 PM, Renaud Waldura <[EMAIL PROTECTED]> wrote: > I've been poking arou

Lucene to index OCR text

2008-01-24 Thread Renaud Waldura
I've been poking around the list archives and didn't really come up against anything interesting. Anyone using Lucene to index OCR text? Any strategies/algorithms/packages you recommend? I have a large collection (10^7 docs) that's mostly the result of OCR. We index/search/e