I'm the founder of Project Runeberg, the Scandinavian volunteer book scanning project, http://runeberg.org/ where we have mainly been using Abbyy Finereader, with subsequent manual, online proofreading. I'm also involved in Wikisource, the book scanning and proofreading project of the Wikimedia Foundation.
Is anybody training Tesseract to read Swedish and other Scandinavian languages? Is there a tutorial for how to train new languages in Tesseract? I'm running Ubuntu Linux 9.10. The included package for Tesseract 2.03 contains man pages that are next to useless. There seem to be some programs: mftraining, cntraining, unicharset_extractor, but they talk about "box files" and I have no clue what these are. In Project Runeberg, we already have 186,000 pages that are fully proofread, mostly in Swedish and Danish, in various fonts and from different years, meaning different spelling standards. Could these be used for training Tesseract? How do I start? -- Lars Aronsson ([email protected]) Aronsson Datateknik - http://aronsson.se Project Runeberg - free Nordic literature - http://runeberg.org/ -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

