Hi Lars,
The current development version of Tesseract 3.0 does have some
support for Swedish and Norwegian:
http://tesseract-ocr.googlecode.com/svn/trunk/tessdata/

You may want to update the link on your website, since the current
project site is:
http://code.google.com/p/tesseract-ocr/

and not the one on SourceForge. You may need to do further training
for older texts, and I would guess you could produce a Danish version
if needed from the Norwegian. The community here is currently planning
a fork of the code to continue development, since Google has not shown
any activity in several months and nobody else has write access to the
source code.
Good luck with your project!
--Sven


On Fri, Apr 23, 2010 at 9:28 AM, Lars Aronsson <[email protected]> wrote:
> I'm the founder of Project Runeberg, the Scandinavian
> volunteer book scanning project, http://runeberg.org/
> where we have mainly been using Abbyy Finereader,
> with subsequent manual, online proofreading.
> I'm also involved in Wikisource, the book scanning
> and proofreading project of the Wikimedia Foundation.
>
> Is anybody training Tesseract to read Swedish and
> other Scandinavian languages? Is there a tutorial
> for how to train new languages in Tesseract?
>
> I'm running Ubuntu Linux 9.10. The included package
> for Tesseract 2.03 contains man pages that are next
> to useless. There seem to be some programs: mftraining,
> cntraining, unicharset_extractor, but they talk about
> "box files" and I have no clue what these are.
>
> In Project Runeberg, we already have 186,000 pages
> that are fully proofread, mostly in Swedish and
> Danish, in various fonts and from different years,
> meaning different spelling standards. Could these
> be used for training Tesseract? How do I start?
>
>
> --
>  Lars Aronsson ([email protected])
>  Aronsson Datateknik - http://aronsson.se
>
>  Project Runeberg - free Nordic literature - http://runeberg.org/
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>
>



-- 
``All that is gold does not glitter,
  not all those who wander are lost;
the old that is strong does not wither,
  deep roots are not reached by the frost.
>From the ashes a fire shall be woken,
  a light from the shadows shall spring;
renewed shall be blade that was broken,
  the crownless again shall be king.”

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to