Sven Pedersen wrote:
Hi Lars, The current development version of Tesseract 3.0 does have some support for Swedish and Norwegian: http://tesseract-ocr.googlecode.com/svn/trunk/tessdata/
I downloaded (from SVN) Tesseract 3.0, compiled it, and ran it with "-l swe" (Swedish language) on some pages. On this page, http://runeberg.org/strindbg/diktvers/0046.html it interpreted some » right-angle-quotation-marks as ">>" and failed to recognize the "H" in this font. Other than this it was quite successful. But that's a very easy page. I also found some instructions for how to train new languages in Tesseract 2.0x. I don't know if these instructions are still valid for 3.0, but it seems very strange that I should start to generate a TIFF with "a b c d e f" in order to train Swedish language, since these letters are already used in English and German. The uppercase "H" in this particular font does need to be trained, but that is not specific to Swedish, but common to all languages that use the Latin script. Swedish uses all German letters plus a-ring (å), only with somewhat different probability weights. (Some would say u-umlaut is not used in Swedish, but it does appear in some personal names and OCR is much worse without it.) Maybe the instructions I found are useful for training entirely new scripts (Cyrillic, Hebrew, Hindi, ...) and not for new languages that use an already supported script. This should be clarified in the text. It would seem rational to make an OCR program for Latin script recognize all letters and accents in the most common 8-bit codes (ISO 8859-1, -2, -3, ...) and only vary the probability weights and dictionaries to add new languages. In Tesseract 3.0, the language data is in a single file, e.g. tessdata/swe.traineddata Is this file format documented? Could I edit it manually and get something useful?
The community here is currently planning a fork of the code to continue development, since Google has not shown any activity in several months and nobody else has write access to the source code.
Oh, really. Is anybody taking the lead, and do you have any funding for this? -- Lars Aronsson ([email protected]) Aronsson Datateknik - http://aronsson.se Project Runeberg - free Nordic literature - http://runeberg.org/ -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

