Sochenda, Attached khamer alphabets txt prepared based on charactermap as well as unicode chart - since I am unable to type in your lang eventhough i have installed font supplied by you.. please prepare text (saved as utf8) as per sample txt file attached. I shall try to generated trained data.
On Wed, Jan 19, 2011 at 12:58 PM, KHEM Sochenda <khemsoche...@gmail.com>wrote: > Dear Dmitry and Sriranga, > > Thank you very much for you help. The reason why my output file is empty > because I put my person ID to the glyphs, isn't it? > > Dear Dmitry, > Please see the image attached, shall the image in the red box assigned to a > Unicode character or seperated as in the image? This glyph is composed of > two other glyphs-- one can be represented by a Unicode character, and the > other is a part of a vowel. > > Dear Sriranga, > > Are the several first lines in your unicharset files represent a > characters, or just any unicode character represent no any character. > > Khmer font is also attached. > > Best Regards, > Sochenda > > > > On Tue, Jan 18, 2011 at 8:27 PM, Dmitry Silaev <daemons2...@gmail.com>wrote: > >> Dear Sochenda, >> >> In addition to what Sriranga said I'd remind that you should do a lot of >> manual work: >> >> In pyTesseractTrainer check that no bounding boxes intersect glyphs; if >> some does - correct its BB coordinates manually. >> >> In cases of BB overlap you should space out participating glyphs in the >> training image (see the attached picture for examples). >> >> You should use manual spacing if participating glyphs are dependent >> characters (in your language - vowels) and the number of possible >> combinations is practically uncountable. Then you would assign every glyph >> its own code. Tess would consider these glyphs as separate characters and >> you should post-process the resulting code sequence to obtain a well-formed >> dependent Unicode pair (or triplet). >> >> If there can be only few such combinations - you can merge these BBs into >> one to encompass all the required glyphs and assign a single code to the >> entire glyph combination. Then during the post-processing you'll need to >> replace this single code with a predefined dependent Unicode pair. >> >> Hope I've managed to express myself clearly. >> >> Warm regards, >> Dmitry Silaev >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To post to this group, send email to tesseract-ocr@googlegroups.com. >> To unsubscribe from this group, send email to >> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com> >> . >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en. >> > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To post to this group, send email to tesseract-ocr@googlegroups.com. > To unsubscribe from this group, send email to > tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com> > . > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
1780 ក ខ គ ឃ ង ច ឆ ជ ឈ ញ ដ ឋ ឌ ឍ ណ ត ថ ទ ធ ន ប ផ ព ភ ម យ រ លវ ឝ ឞ ស ហ ឡ អ ឣ ឤ ឥ ឦ ឧ ឨ ឩ ឪ ឫ ឬ iNDEPENDENT VOWEL ឬ ឭ ឮ ឯ ឰ ឱ ឲ ឳ DEPENDENT VOWEL ឵ ិ ិី ឺឹ ឺ ុ ូ ួ TWO PART DEPENDENT VOWEL ើ ឿ ៀ 1782 1783 1784 1785 1786 1787 1788 1789 178A 178B 178C 178D 178E 178F 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 179A 179B 179C 179D 17B6 17B7 17B8