Re: Tesseract Training

Sriranga(78yrsold) Sun, 16 Jan 2011 23:06:55 -0800

Viewed Khemer unicode chart (pdf) there are dependent vowels are there. It
is better to use bbtool to generate box file. please see wiki section for
tools.


On Mon, Jan 17, 2011 at 12:24 PM, Sriranga(78yrsold) <
withblessi...@gmail.com> wrote:

> Is there are dependent vowel in your Khemer lang. If you have unicode
> chart  better to upload
>
>
> On Mon, Jan 17, 2011 at 12:13 PM, KHEM Sochenda <khemsoche...@gmail.com>wrote:
>
>> I know how to do it in tesseract, but the image just to show you how the
>> glyphs should be boxed.
>>
>> I can send you the box file generate by tesseract anyway.
>>
>> Regards,
>>
>> Sochenda
>>
>>
>> On Mon, Jan 17, 2011 at 1:41 PM, Sriranga(78yrsold) <
>> withblessi...@gmail.com> wrote:
>>
>>> as per wiki instructions.- commandline has to be used to generate box
>>> file as follow - as per wiki instructions.
>>> tesseract <lang.fontname.number.tif >   <lang.fontname.number>
>>> batch.nochop makebox
>>>
>>>
>>>
>>> On Mon, Jan 17, 2011 at 11:55 AM, KHEM Sochenda 
>>> <khemsoche...@gmail.com>wrote:
>>>
>>>> In the image, I've done manually.
>>>>
>>>> On Mon, Jan 17, 2011 at 12:16 PM, Sriranga(78yrsold) <
>>>> withblessi...@gmail.com> wrote:
>>>>
>>>>> Which tool you have used to create boxes. Please also upload box file
>>>>> generated by you.
>>>>>
>>>>>
>>>>> On Mon, Jan 17, 2011 at 9:31 AM, KHEM Sochenda <khemsoche...@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Dear Dmitry,
>>>>>>
>>>>>> Thank you again for a very quick response.
>>>>>>
>>>>>> I am going to train tesseract for Khmer Language in which there are
>>>>>> many ligatures are in the same cases as "fi" in some latin fonts.
>>>>>> The attachment show you the example of the one line khmer sentence,
>>>>>> please count the box from left to right. You can see that some glyphs are
>>>>>> above to others. The first glyph is formed of two unicode characters,
>>>>>> somehow the third glyph and the fifth glyph form a Unicode characters. 
>>>>>> This
>>>>>> is the reason why I wish to give each glype its own ID and then I do a 
>>>>>> post
>>>>>> processing afterward.
>>>>>>
>>>>>> Regarding the two glyphs which are overlapped each other like the case
>>>>>> of 7th glyph and the 8th glyph, how tesseract will segment these glyphs?
>>>>>> How to give the position of the boxes?
>>>>>>
>>>>>>
>>>>>> Thank you very much in advance for your response.
>>>>>>
>>>>>>
>>>>>> Best Regards,
>>>>>>
>>>>>> Sochenda
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Jan 16, 2011 at 3:48 PM, Dmitry Silaev <daemons2...@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Dear Sochenda,
>>>>>>>
>>>>>>> I'm not sure what's the ultimate goal of your code assignment but a
>>>>>>> formal answer to your question is "Yes". You can assign "k001" or 
>>>>>>> "k002" to
>>>>>>> a bounding box in a .box file. Moreover, you can assign any UTF-8 
>>>>>>> encoded
>>>>>>> character sequence. In Tess version 3.0x (current) the only restriction 
>>>>>>> is a
>>>>>>> 24 byte limit for the entire char sequence length. This also allows you 
>>>>>>> to
>>>>>>> use not only an abstract code like "k001" but a meaningful character
>>>>>>> sequence from your real language (e.g. a well-known "fi" ligature in 
>>>>>>> some
>>>>>>> Latin fonts) which then relieves you from using the pre- and
>>>>>>> post-processing.
>>>>>>>
>>>>>>> If you still prefer using abstract codes then pre-/post-processing
>>>>>>> can be done without tinkering with Tess's code. Since training as well 
>>>>>>> as
>>>>>>> recognition result in generation of output files, you can develop a 
>>>>>>> couple
>>>>>>> of file processing command-line utilities which then can be used along 
>>>>>>> with
>>>>>>> calls to the Tesseract executable within shell scripts (or .bat files in
>>>>>>> Windows).
>>>>>>>
>>>>>>> For further details you definitely should study thoroughly the
>>>>>>> "TrainingTesseract3" and "ReadMe" (section "Installation Notes - 
>>>>>>> Tesseract
>>>>>>> 3.00") documents (
>>>>>>> http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 and
>>>>>>> http://code.google.com/p/tesseract-ocr/wiki/ReadMe). These are not
>>>>>>> quite easy searchable documents but they contain all the info you might
>>>>>>> need.
>>>>>>>
>>>>>>> Warm regards,
>>>>>>> Dmitry Silaev
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Jan 16, 2011 at 10:42 AM, KHEM Sochenda <
>>>>>>> khemsoche...@gmail.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Dear Dmitry,
>>>>>>>>
>>>>>>>> Thank you very much for a comprehensive explanation.
>>>>>>>> Let say, to go straight, does it sound ok by assigning a code like
>>>>>>>> 'k001' or 'k002' to the glype obtain from tesseract segmentation?
>>>>>>>>
>>>>>>>> For post processing, touching the code tesseract, could you please
>>>>>>>> point me out which I files I should modify to work on. Advice me if 
>>>>>>>> the last
>>>>>>>> version of tesseract will do fine.
>>>>>>>>
>>>>>>>> Thank you very much in advance for your time and response back.
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>>
>>>>>>>> Sochenda
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Jan 15, 2011 at 3:05 AM, Dmitry Silaev <
>>>>>>>> daemons2...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Chenda,
>>>>>>>>>
>>>>>>>>> In fact Tesseract doesn't care if you do training for a real
>>>>>>>>> language's letter and which language this letter belongs to. 
>>>>>>>>> Simplistically
>>>>>>>>> saying Tess only saves the mapping of feature sets obtained from 
>>>>>>>>> training to
>>>>>>>>> Unicode ids. This implies that during training you can assign 
>>>>>>>>> virtually any
>>>>>>>>> character code to virtually any glyph (to be exact, to a connected 
>>>>>>>>> component
>>>>>>>>> or to a set of connected components).
>>>>>>>>>
>>>>>>>>> If your language script is comprised by a reasonable number of
>>>>>>>>> joint character combinations then while training you can assign every 
>>>>>>>>> such
>>>>>>>>> combination a predefined Unicode id (some restrictions apply). Later, 
>>>>>>>>> when
>>>>>>>>> running recognition, you should do some post-processing to decode your
>>>>>>>>> predefined ids into real language's character sequences.
>>>>>>>>>
>>>>>>>>> For good results all this requires you to develop a training file
>>>>>>>>> pre-processor (mapping: language char combinations -> provisional 
>>>>>>>>> ids) and a
>>>>>>>>> recognition result post-processor (mapping: provisional ids -> 
>>>>>>>>> language char
>>>>>>>>> sequences). I'm not sure but this also may require correcting 
>>>>>>>>> character
>>>>>>>>> property bit masks in the unicharset file (I don't know exactly how 
>>>>>>>>> this
>>>>>>>>> information is used by Tess as I don't need it in my project).
>>>>>>>>>
>>>>>>>>> Warm regards,
>>>>>>>>> Dmitry Silaev
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jan 14, 2011 at 10:25 AM, KHEM Sochenda <
>>>>>>>>> khemsoche...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Dear Tesseract Team,
>>>>>>>>>>
>>>>>>>>>> In training new language step, we have to assign a unicode value
>>>>>>>>>> to each box.
>>>>>>>>>> I would like to know if a shape that is composed of *several
>>>>>>>>>> unicode characters?
>>>>>>>>>> Is there anyway to assign only an id for each box in tesseract?
>>>>>>>>>>
>>>>>>>>>> Thank you very much in advance for your response.
>>>>>>>>>>
>>>>>>>>>> Best Regards,
>>>>>>>>>> Chenda *
>>>>>>>>>>
>>>>>>>>>>    1. **
>>>>>>>>>>
>>>>>>>>>>  --
>>>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>>>> To post to this group, send email to
>>>>>>>>>> tesseract-ocr@googlegroups.com.
>>>>>>>>>> To unsubscribe from this group, send email to
>>>>>>>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
>>>>>>>>>> .
>>>>>>>>>> For more options, visit this group at
>>>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  --
>>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>>> To post to this group, send email to
>>>>>>>>> tesseract-ocr@googlegroups.com.
>>>>>>>>> To unsubscribe from this group, send email to
>>>>>>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
>>>>>>>>> .
>>>>>>>>> For more options, visit this group at
>>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>>>>
>>>>>>>>
>>>>>>>>  --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com
>>>>>>>> .
>>>>>>>> To unsubscribe from this group, send email to
>>>>>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
>>>>>>>> .
>>>>>>>> For more options, visit this group at
>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>>>
>>>>>>>
>>>>>>>  --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>>>>> To unsubscribe from this group, send email to
>>>>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
>>>>>>> .
>>>>>>> For more options, visit this group at
>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>>
>>>>>>
>>>>>>  --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>>>> To unsubscribe from this group, send email to
>>>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
>>>>>> .
>>>>>> For more options, visit this group at
>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>
>>>>>
>>>>>  --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>>> To unsubscribe from this group, send email to
>>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
>>>>> .
>>>>> For more options, visit this group at
>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>
>>>>
>>>>  --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>> To unsubscribe from this group, send email to
>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
>>>> .
>>>> For more options, visit this group at
>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google Groups
>>> "tesseract-ocr" group.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>> To unsubscribe from this group, send email to
>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
>>> .
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> To unsubscribe from this group, send email to
>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
>> .
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Tesseract Training

Reply via email to