Re: Training for Swedish, Danish, Norwegian, old spelling, fraktur

Lars Aronsson Sat, 24 Apr 2010 05:37:14 -0700

Sven Pedersen wrote:

Hi Lars,
The current development version of Tesseract 3.0 does have some
support for Swedish and Norwegian:
http://tesseract-ocr.googlecode.com/svn/trunk/tessdata/


I downloaded (from SVN) Tesseract 3.0, compiled it, and
ran it with "-l swe" (Swedish language) on some pages.

On this page, http://runeberg.org/strindbg/diktvers/0046.html
it interpreted some » right-angle-quotation-marks as ">>"
and failed to recognize the "H" in this font. Other than this
it was quite successful. But that's a very easy page.

I also found some instructions for how to train new
languages in Tesseract 2.0x.  I  don't know if these
instructions are still valid for 3.0, but it seems very
strange that I should start to generate a TIFF with
"a b c d e f" in order to train Swedish language, since
these letters are already used in English and German.
The uppercase "H" in this particular font does need
to be trained, but that is not specific to Swedish,
but common to all languages that use the Latin script.
Swedish uses all German letters plus a-ring (å), only
with somewhat different probability weights. (Some
would say u-umlaut is not used in Swedish, but it
does appear in some personal names and OCR is much
worse without it.)

Maybe the instructions I found are useful for training
entirely new scripts (Cyrillic, Hebrew, Hindi, ...) and
not for new languages that use an already supported
script. This should be clarified in the text.

It would seem rational to make an OCR program for
Latin script recognize all letters and accents
in the most common 8-bit codes (ISO 8859-1, -2, -3, ...)
and only vary the probability weights and dictionaries
to add new languages.

In Tesseract 3.0, the language data is in a single
file, e.g. tessdata/swe.traineddata
Is this file format documented? Could I edit it
manually and get something useful?

The community here is currently planning
a fork of the code to continue development, since Google has not shown
any activity in several months and nobody else has write access to the
source code.


Oh, really. Is anybody taking the lead,
and do you have any funding for this?


--
 Lars Aronsson ([email protected])
 Aronsson Datateknik - http://aronsson.se

 Project Runeberg - free Nordic literature - http://runeberg.org/


--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Training for Swedish, Danish, Norwegian, old spelling, fraktur

Reply via email to