Hi Lars, Yes, I think putting the language and script in a single unit is not the best idea. But Tesseract does support other scripts like Chinese and Indic -- some success is reported with Indic. Actually, I think the way the algorithm works makes the combination of dictionary and orthography more effective.
For training details, it would be good to look through the archives of this mailing list, since people have recently asked these questions: http://groups.google.com/group/tesseract-ocr?pli=1 In particular, you'll find that you can just train starting from the existing Swedish with some of the texts and fonts you'll be working with. People have made programs to help generate the right kind of training data. The fork of the project is not funded per se, but the developer who is taking the lead has funding for his part of the work, and some of us hope to get organizations we're involved with to participate. --Sven On Sat, Apr 24, 2010 at 6:55 AM, Lars Aronsson <[email protected]> wrote: > Sven Pedersen wrote: >> >> Hi Lars, >> The current development version of Tesseract 3.0 does have some >> support for Swedish and Norwegian: >> http://tesseract-ocr.googlecode.com/svn/trunk/tessdata/ > > I downloaded (from SVN) Tesseract 3.0, compiled it, and > ran it with "-l swe" (Swedish language) on some pages. > > On this page, http://runeberg.org/strindbg/diktvers/0046.html > it interpreted some » right-angle-quotation-marks as ">>" > and failed to recognize the "H" in this font. Other than this > it was quite successful. But that's a very easy page. > > I also found some instructions for how to train new > languages in Tesseract 2.0x. I don't know if these > instructions are still valid for 3.0, but it seems very > strange that I should start to generate a TIFF with > "a b c d e f" in order to train Swedish language, since > these letters are already used in English and German. > The uppercase "H" in this particular font does need > to be trained, but that is not specific to Swedish, > but common to all languages that use the Latin script. > Swedish uses all German letters plus a-ring (å), only > with somewhat different probability weights. (Some > would say u-umlaut is not used in Swedish, but it > does appear in some personal names and OCR is much > worse without it.) > > Maybe the instructions I found are useful for training > entirely new scripts (Cyrillic, Hebrew, Hindi, ...) and > not for new languages that use an already supported > script. This should be clarified in the text. > > It would seem rational to make an OCR program for > Latin script recognize all letters and accents > in the most common 8-bit codes (ISO 8859-1, -2, -3, ...) > and only vary the probability weights and dictionaries > to add new languages. > > In Tesseract 3.0, the language data is in a single > file, e.g. tessdata/swe.traineddata > Is this file format documented? Could I edit it > manually and get something useful? > >> The community here is currently planning >> a fork of the code to continue development, since Google has not shown >> any activity in several months and nobody else has write access to the >> source code. > > Oh, really. Is anybody taking the lead, > and do you have any funding for this? > > > -- > Lars Aronsson ([email protected]) > Aronsson Datateknik - http://aronsson.se > > Project Runeberg - free Nordic literature - http://runeberg.org/ > > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en. > > -- ``All that is gold does not glitter, not all those who wander are lost; the old that is strong does not wither, deep roots are not reached by the frost. >From the ashes a fire shall be woken, a light from the shadows shall spring; renewed shall be blade that was broken, the crownless again shall be king.” -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

