Re: Training for Swedish, Danish, Norwegian, old spelling, fraktur

Lars Aronsson Wed, 28 Apr 2010 06:16:47 -0700

Sven, the more you describe the situation,
the more I realize that my needs are not the
same as yours or others who are here.
Has anybody, now that you are discussing
a fork, tried to draw a map of what kinds of
needs the user community has?


I'm scanning old books and newspapers, and
want to make really good OCR that can be
manually proofread with as little effort as
possible. This means lots of old typefaces,
lots of old spelling, lots of strange names,
often different languages on the same page,
often bad print quality, often complex page
layout. When I discover an error, I want to
fix the OCR engine, continuously training it
to become more and more perfect. If I find
a new kind of upper-case "H", it would be
insane to apply this new experience only to
the interpretation of Swedish, since it will
soon appear in texts in other languages.
It would also be insane if I was the only one
to benefit from such an improvement. It
should go back into the engine, so all users
can benefit.

The way language training is described in
Tesseract, it clearly can't meet these needs.
The software never was designed with
these goals in mind, or it would look very
different. Just one example: If I want to
train "fraktur" (black letter), there's no
easy way I can generate a pattern page because
I don't have fraktur fonts installed on my
computer. I never write fraktur, I only read
it in old books.

The internal needs of Google Book Search
should be very similar to my needs, and if
that's where the previous lead developer
works, I can understand if he has abandoned
Tesseract for some other design. I can also
understand if Google wants to keep that new
design to themselves. It would most probably
be based on statistics from the many million
books that Google has already scanned.

Does anybody know of an open source OCR
project that is based on statistics from
scanned books? Could parts of the Tesseract
software library be used to cut out letters
from scanned pages, so some other software
could group them statistically?


--
 Lars Aronsson ([email protected])
 Aronsson Datateknik - http://aronsson.se


--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Training for Swedish, Danish, Norwegian, old spelling, fraktur

Reply via email to