Dear Sochenda, In addition to what Sriranga said I'd remind that you should do a lot of manual work:
In pyTesseractTrainer check that no bounding boxes intersect glyphs; if some does - correct its BB coordinates manually. In cases of BB overlap you should space out participating glyphs in the training image (see the attached picture for examples). You should use manual spacing if participating glyphs are dependent characters (in your language - vowels) and the number of possible combinations is practically uncountable. Then you would assign every glyph its own code. Tess would consider these glyphs as separate characters and you should post-process the resulting code sequence to obtain a well-formed dependent Unicode pair (or triplet). If there can be only few such combinations - you can merge these BBs into one to encompass all the required glyphs and assign a single code to the entire glyph combination. Then during the post-processing you'll need to replace this single code with a predefined dependent Unicode pair. Hope I've managed to express myself clearly. Warm regards, Dmitry Silaev -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
<<attachment: figure01.GIF>>