Training for Swedish, Danish, Norwegian, old spelling, fraktur

Lars Aronsson Fri, 23 Apr 2010 10:17:04 -0700

I'm the founder of Project Runeberg, the Scandinavian
volunteer book scanning project, http://runeberg.org/
where we have mainly been using Abbyy Finereader,
with subsequent manual, online proofreading.
I'm also involved in Wikisource, the book scanning
and proofreading project of the Wikimedia Foundation.


Is anybody training Tesseract to read Swedish and
other Scandinavian languages? Is there a tutorial
for how to train new languages in Tesseract?

I'm running Ubuntu Linux 9.10. The included package
for Tesseract 2.03 contains man pages that are next
to useless. There seem to be some programs: mftraining,
cntraining, unicharset_extractor, but they talk about
"box files" and I have no clue what these are.

In Project Runeberg, we already have 186,000 pages
that are fully proofread, mostly in Swedish and
Danish, in various fonts and from different years,
meaning different spelling standards. Could these
be used for training Tesseract? How do I start?


--
 Lars Aronsson ([email protected])
 Aronsson Datateknik - http://aronsson.se

 Project Runeberg - free Nordic literature - http://runeberg.org/


--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Training for Swedish, Danish, Norwegian, old spelling, fraktur

Reply via email to