Re: Tesseract Training

Sriranga(78yrsold) Wed, 19 Jan 2011 01:24:29 -0800

Sochenda,
Attached khamer alphabets txt prepared based on charactermap as well as
unicode chart - since I am unable to type in your lang eventhough i have
installed font supplied by you..
please prepare text (saved as utf8) as per sample txt file attached. I shall
try to generated trained data.


On Wed, Jan 19, 2011 at 12:58 PM, KHEM Sochenda <khemsoche...@gmail.com>wrote:

> Dear Dmitry and Sriranga,
>
> Thank you very much for you help. The reason why my output file is empty
> because I put my person ID to the glyphs, isn't it?
>
> Dear Dmitry,
> Please see the image attached, shall the image in the red box assigned to a
> Unicode character or seperated as in the image? This glyph is composed of
> two other glyphs-- one can be represented by a Unicode character, and the
> other is a part of a vowel.
>
> Dear Sriranga,
>
> Are the several first lines in your unicharset files represent a
> characters, or just any unicode character represent no any character.
>
> Khmer font is also attached.
>
> Best Regards,
> Sochenda
>
>
>
> On Tue, Jan 18, 2011 at 8:27 PM, Dmitry Silaev <daemons2...@gmail.com>wrote:
>
>> Dear Sochenda,
>>
>> In addition to what Sriranga said I'd remind that you should do a lot of
>> manual work:
>>
>> In pyTesseractTrainer check that no bounding boxes intersect glyphs; if
>> some does - correct its BB coordinates manually.
>>
>> In cases of BB overlap you should space out participating glyphs in the
>> training image (see the attached picture for examples).
>>
>> You should use manual spacing if participating glyphs are dependent
>> characters (in your language - vowels) and the number of possible
>> combinations is practically uncountable. Then you would assign every glyph
>> its own code. Tess would consider these glyphs as separate characters and
>> you should post-process the resulting code sequence to obtain a well-formed
>> dependent Unicode pair (or triplet).
>>
>> If there can be only few such combinations - you can merge these BBs into
>> one to encompass all the required glyphs and assign a single code to the
>> entire glyph combination. Then during the post-processing you'll need to
>> replace this single code with a predefined dependent Unicode pair.
>>
>> Hope I've managed to express myself clearly.
>>
>> Warm regards,
>> Dmitry Silaev
>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> To unsubscribe from this group, send email to
>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
>> .
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> To unsubscribe from this group, send email to
> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
> .
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

1780





ក ខ គ ឃ ង ច ឆ ជ ឈ ញ ដ ឋ ឌ ឍ ណ ត ថ ទ ធ ន ប ផ ព ភ ម 
យ រ លវ ឝ ឞ ស ហ ឡ អ ឣ ឤ ឥ ឦ ឧ ឨ ឩ ឪ ឫ ឬ    
iNDEPENDENT VOWEL
ឬ ឭ ឮ ឯ ឰ ឱ ឲ ឳ 
DEPENDENT VOWEL
឵    ិ   ិី    ឺឹ    ឺ   ុ   ូ  ួ   
TWO PART DEPENDENT VOWEL
ើ   ឿ   ៀ


1782
1783
1784
1785
1786
1787
1788
1789
178A
178B
178C
178D
178E
178F
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
179A
179B
179C
179D
17B6
17B7
17B8

Re: Tesseract Training

Reply via email to