You could take try take a  large corpus of the text (say Wikipedia) and use it 
to inform the likelihood of word sequences. Take the OCR output and produce 
fuzzy spelling variations for each word in a window of text (say 5 or 6 words) 
and then examine the likelihood of the different permutations using the corpus. 
That's a lot of combinations, edit distance calculations and a lot of 
SpanQueries so performance will suffer but accuracy is likely to be better than 
anything based on single word analysis. As mentioned before, if a "confidence 
level" was available from the OCR software then that would avoid a lot of 
unnecessary lookups or potential replacement of correctly OCRed words with 
alternative words deemed to be statistically more likely.

Cheers
Mark





----- Original Message ----
From: Paul Elschot <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Tuesday, 29 January, 2008 8:00:56 AM
Subject: Re: Lucene to index OCR text

Op 
Tuesday 
29 
January 
2008 
03:32:08 
schreef 
Daniel 
Noll:
> 
On 
Friday 
25 
January 
2008 
19:26:44 
Paul 
Elschot 
wrote:
> 
> 
There 
is 
no 
way 
to 
do 
exact 
phrase 
matching 
on 
OCR 
data, 
because 
no
> 
> 
correction 
of 
OCR 
data 
will 
be 
perfect. 
Otherwise 
the 
OCR 
would 
have 
made
> 
> 
the 
correction...
> 
> 
<snip 
suggestion 
to 
use 
fuzzy 
query>
> 
> 
The 
problem 
I 
see 
with 
a 
fuzzy 
query 
is 
that 
if 
you 
have 
the 
fuzziness 
set 
to 
> 
1, 
then 
"fat" 
will 
match 
"mat".  
But 
in 
reality, 
"f" 
and 
"m" 
don't 
get 
> 
confused 
with 
OCR.
> 
> 
What 
you 
really 
want 
is 
for 
a 
given 
term 
to 
expand 
to 
a 
boolean 
query 
of 
all 
> 
possible 
misidentified 
alternatives.  
For 
that 
you 
would 
first 
need 
to 
figure 
> 
out 
which 
characters 
are 
often 
misidentified 
as 
others, 
which 
can 
probably 
be 
> 
achieved 
by 
going 
over 
a 
certain 
number 
of 
documents 
and 
manually 
checking 
> 
which 
letters 
are 
wrong.
> 
> 
This 
should 
provide 
slightly 
more 
comprehensive 
matching 
without 
matching 
> 
terms 
which 
are 
obviously 
different 
to 
the 
naked 
eye.

It's 
also 
possible 
to 
select 
the 
fuzzy 
terms 
by 
their 
document 
frequency, 
and
reject 
all 
that 
have 
a 
((quite) 
a 
bit) 
higher 
doc 
frequency 
than 
the 
given 
term.

Combined 
with 
a 
query 
proximity 
to 
another 
similarly 
queried 
term 
this 
can
work 
reasonably 
well. 
For 
query 
search 
performance 
selecting 
only 
low
frequency 
terms 
is 
nice, 
as 
it 
avoids 
searching 
for 
high 
frequency 
terms.

Btw, 
this 
use 
of 
a 
worse 
spelling 
is 
more 
or 
less 
the 
opposite 
of 
suggesting
a 
better 
spelling 
from 
terms 
with 
a 
higher 
doc 
frequency.

> 
> 
What 
would 
be 
ideal 
is 
if 
an 
analyser 
could 
do 
this 
job 
(a 
"looks 
like" 
> 
analyser, 
like 
how 
SoundEx 
is 
a 
"sounds 
like" 
analyser.)  
But 
I 
get 
the 
> 
feeling 
that 
this 
would 
be 
very 
difficult.  
Shame 
the 
OCR 
software 
can't 
> 
store 
this 
information, 
e.g. 
"80% 
odds 
that 
this 
character 
is 
a 
t 
but 
20% 
> 
odds 
that 
it's 
an 
f."  
If 
you 
had 
that 
for 
every 
character 
it 
would 
be 
very 
> 
useful...

Ah 
yes, 
the 
ideal 
world. 
Is 
there 
OCR 
software 
that 
provides 
such 
details?

Regards,
Paul 
Elschot

---------------------------------------------------------------------
To 
unsubscribe, 
e-mail: 
[EMAIL PROTECTED]
For 
additional 
commands, 
e-mail: 
[EMAIL PROTECTED]






      __________________________________________________________
Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to