Re: Tesseract Training

Sriranga(78yrsold) Wed, 19 Jan 2011 01:26:02 -0800

please ensure typed alphabets as a text and  not image file.

2011/1/19 Sriranga(78yrsold) <withblessi...@gmail.com>


> Sochenda,
> Attached khamer alphabets txt prepared based on charactermap as well as
> unicode chart - since I am unable to type in your lang eventhough i have
> installed font supplied by you..
> please prepare text (saved as utf8) as per sample txt file attached. I
> shall try to generated trained data.
>
>
> On Wed, Jan 19, 2011 at 12:58 PM, KHEM Sochenda <khemsoche...@gmail.com>wrote:
>
>> Dear Dmitry and Sriranga,
>>
>> Thank you very much for you help. The reason why my output file is empty
>> because I put my person ID to the glyphs, isn't it?
>>
>> Dear Dmitry,
>> Please see the image attached, shall the image in the red box assigned to
>> a Unicode character or seperated as in the image? This glyph is composed of
>> two other glyphs-- one can be represented by a Unicode character, and the
>> other is a part of a vowel.
>>
>> Dear Sriranga,
>>
>> Are the several first lines in your unicharset files represent a
>> characters, or just any unicode character represent no any character.
>>
>> Khmer font is also attached.
>>
>> Best Regards,
>>  Sochenda
>>
>>
>>
>> On Tue, Jan 18, 2011 at 8:27 PM, Dmitry Silaev <daemons2...@gmail.com>wrote:
>>
>>> Dear Sochenda,
>>>
>>> In addition to what Sriranga said I'd remind that you should do a lot of
>>> manual work:
>>>
>>> In pyTesseractTrainer check that no bounding boxes intersect glyphs; if
>>> some does - correct its BB coordinates manually.
>>>
>>> In cases of BB overlap you should space out participating glyphs in the
>>> training image (see the attached picture for examples).
>>>
>>> You should use manual spacing if participating glyphs are dependent
>>> characters (in your language - vowels) and the number of possible
>>> combinations is practically uncountable. Then you would assign every glyph
>>> its own code. Tess would consider these glyphs as separate characters and
>>> you should post-process the resulting code sequence to obtain a well-formed
>>> dependent Unicode pair (or triplet).
>>>
>>> If there can be only few such combinations - you can merge these BBs into
>>> one to encompass all the required glyphs and assign a single code to the
>>> entire glyph combination. Then during the post-processing you'll need to
>>> replace this single code with a predefined dependent Unicode pair.
>>>
>>> Hope I've managed to express myself clearly.
>>>
>>> Warm regards,
>>> Dmitry Silaev
>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google Groups
>>> "tesseract-ocr" group.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>> To unsubscribe from this group, send email to
>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
>>> .
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> To unsubscribe from this group, send email to
>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
>> .
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Tesseract Training

Reply via email to