Re: Tesseract Training

Dmitry Silaev Tue, 18 Jan 2011 05:28:01 -0800

Dear Sochenda,

In addition to what Sriranga said I'd remind that you should do a lot of
manual work:


In pyTesseractTrainer check that no bounding boxes intersect glyphs; if some
does - correct its BB coordinates manually.

In cases of BB overlap you should space out participating glyphs in the
training image (see the attached picture for examples).

You should use manual spacing if participating glyphs are dependent
characters (in your language - vowels) and the number of possible
combinations is practically uncountable. Then you would assign every glyph
its own code. Tess would consider these glyphs as separate characters and
you should post-process the resulting code sequence to obtain a well-formed
dependent Unicode pair (or triplet).

If there can be only few such combinations - you can merge these BBs into
one to encompass all the required glyphs and assign a single code to the
entire glyph combination. Then during the post-processing you'll need to
replace this single code with a predefined dependent Unicode pair.

Hope I've managed to express myself clearly.

Warm regards,
Dmitry Silaev

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

<<attachment: figure01.GIF>>

Re: Tesseract Training

Reply via email to