Is there are dependent vowel in your Khemer lang. If you have unicode chart
better to upload

On Mon, Jan 17, 2011 at 12:13 PM, KHEM Sochenda <khemsoche...@gmail.com>wrote:

> I know how to do it in tesseract, but the image just to show you how the
> glyphs should be boxed.
>
> I can send you the box file generate by tesseract anyway.
>
> Regards,
>
> Sochenda
>
>
> On Mon, Jan 17, 2011 at 1:41 PM, Sriranga(78yrsold) <
> withblessi...@gmail.com> wrote:
>
>> as per wiki instructions.- commandline has to be used to generate box file
>> as follow - as per wiki instructions.
>> tesseract <lang.fontname.number.tif >   <lang.fontname.number>
>> batch.nochop makebox
>>
>>
>>
>> On Mon, Jan 17, 2011 at 11:55 AM, KHEM Sochenda 
>> <khemsoche...@gmail.com>wrote:
>>
>>> In the image, I've done manually.
>>>
>>> On Mon, Jan 17, 2011 at 12:16 PM, Sriranga(78yrsold) <
>>> withblessi...@gmail.com> wrote:
>>>
>>>> Which tool you have used to create boxes. Please also upload box file
>>>> generated by you.
>>>>
>>>>
>>>> On Mon, Jan 17, 2011 at 9:31 AM, KHEM Sochenda 
>>>> <khemsoche...@gmail.com>wrote:
>>>>
>>>>> Dear Dmitry,
>>>>>
>>>>> Thank you again for a very quick response.
>>>>>
>>>>> I am going to train tesseract for Khmer Language in which there are
>>>>> many ligatures are in the same cases as "fi" in some latin fonts.
>>>>> The attachment show you the example of the one line khmer sentence,
>>>>> please count the box from left to right. You can see that some glyphs are
>>>>> above to others. The first glyph is formed of two unicode characters,
>>>>> somehow the third glyph and the fifth glyph form a Unicode characters. 
>>>>> This
>>>>> is the reason why I wish to give each glype its own ID and then I do a 
>>>>> post
>>>>> processing afterward.
>>>>>
>>>>> Regarding the two glyphs which are overlapped each other like the case
>>>>> of 7th glyph and the 8th glyph, how tesseract will segment these glyphs?
>>>>> How to give the position of the boxes?
>>>>>
>>>>>
>>>>> Thank you very much in advance for your response.
>>>>>
>>>>>
>>>>> Best Regards,
>>>>>
>>>>> Sochenda
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Jan 16, 2011 at 3:48 PM, Dmitry Silaev 
>>>>> <daemons2...@gmail.com>wrote:
>>>>>
>>>>>> Dear Sochenda,
>>>>>>
>>>>>> I'm not sure what's the ultimate goal of your code assignment but a
>>>>>> formal answer to your question is "Yes". You can assign "k001" or "k002" 
>>>>>> to
>>>>>> a bounding box in a .box file. Moreover, you can assign any UTF-8 encoded
>>>>>> character sequence. In Tess version 3.0x (current) the only restriction 
>>>>>> is a
>>>>>> 24 byte limit for the entire char sequence length. This also allows you 
>>>>>> to
>>>>>> use not only an abstract code like "k001" but a meaningful character
>>>>>> sequence from your real language (e.g. a well-known "fi" ligature in some
>>>>>> Latin fonts) which then relieves you from using the pre- and
>>>>>> post-processing.
>>>>>>
>>>>>> If you still prefer using abstract codes then pre-/post-processing can
>>>>>> be done without tinkering with Tess's code. Since training as well as
>>>>>> recognition result in generation of output files, you can develop a 
>>>>>> couple
>>>>>> of file processing command-line utilities which then can be used along 
>>>>>> with
>>>>>> calls to the Tesseract executable within shell scripts (or .bat files in
>>>>>> Windows).
>>>>>>
>>>>>> For further details you definitely should study thoroughly the
>>>>>> "TrainingTesseract3" and "ReadMe" (section "Installation Notes - 
>>>>>> Tesseract
>>>>>> 3.00") documents (
>>>>>> http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 and
>>>>>> http://code.google.com/p/tesseract-ocr/wiki/ReadMe). These are not
>>>>>> quite easy searchable documents but they contain all the info you might
>>>>>> need.
>>>>>>
>>>>>> Warm regards,
>>>>>> Dmitry Silaev
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Jan 16, 2011 at 10:42 AM, KHEM Sochenda <
>>>>>> khemsoche...@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> Dear Dmitry,
>>>>>>>
>>>>>>> Thank you very much for a comprehensive explanation.
>>>>>>> Let say, to go straight, does it sound ok by assigning a code like
>>>>>>> 'k001' or 'k002' to the glype obtain from tesseract segmentation?
>>>>>>>
>>>>>>> For post processing, touching the code tesseract, could you please
>>>>>>> point me out which I files I should modify to work on. Advice me if the 
>>>>>>> last
>>>>>>> version of tesseract will do fine.
>>>>>>>
>>>>>>> Thank you very much in advance for your time and response back.
>>>>>>>
>>>>>>> Best Regards,
>>>>>>>
>>>>>>> Sochenda
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Jan 15, 2011 at 3:05 AM, Dmitry Silaev <
>>>>>>> daemons2...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Chenda,
>>>>>>>>
>>>>>>>> In fact Tesseract doesn't care if you do training for a real
>>>>>>>> language's letter and which language this letter belongs to. 
>>>>>>>> Simplistically
>>>>>>>> saying Tess only saves the mapping of feature sets obtained from 
>>>>>>>> training to
>>>>>>>> Unicode ids. This implies that during training you can assign 
>>>>>>>> virtually any
>>>>>>>> character code to virtually any glyph (to be exact, to a connected 
>>>>>>>> component
>>>>>>>> or to a set of connected components).
>>>>>>>>
>>>>>>>> If your language script is comprised by a reasonable number of joint
>>>>>>>> character combinations then while training you can assign every such
>>>>>>>> combination a predefined Unicode id (some restrictions apply). Later, 
>>>>>>>> when
>>>>>>>> running recognition, you should do some post-processing to decode your
>>>>>>>> predefined ids into real language's character sequences.
>>>>>>>>
>>>>>>>> For good results all this requires you to develop a training file
>>>>>>>> pre-processor (mapping: language char combinations -> provisional ids) 
>>>>>>>> and a
>>>>>>>> recognition result post-processor (mapping: provisional ids -> 
>>>>>>>> language char
>>>>>>>> sequences). I'm not sure but this also may require correcting character
>>>>>>>> property bit masks in the unicharset file (I don't know exactly how 
>>>>>>>> this
>>>>>>>> information is used by Tess as I don't need it in my project).
>>>>>>>>
>>>>>>>> Warm regards,
>>>>>>>> Dmitry Silaev
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jan 14, 2011 at 10:25 AM, KHEM Sochenda <
>>>>>>>> khemsoche...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Dear Tesseract Team,
>>>>>>>>>
>>>>>>>>> In training new language step, we have to assign a unicode value to
>>>>>>>>> each box.
>>>>>>>>> I would like to know if a shape that is composed of *several
>>>>>>>>> unicode characters?
>>>>>>>>> Is there anyway to assign only an id for each box in tesseract?
>>>>>>>>>
>>>>>>>>> Thank you very much in advance for your response.
>>>>>>>>>
>>>>>>>>> Best Regards,
>>>>>>>>> Chenda *
>>>>>>>>>
>>>>>>>>>    1. **
>>>>>>>>>
>>>>>>>>>  --
>>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>>> To post to this group, send email to
>>>>>>>>> tesseract-ocr@googlegroups.com.
>>>>>>>>> To unsubscribe from this group, send email to
>>>>>>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
>>>>>>>>> .
>>>>>>>>> For more options, visit this group at
>>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>>>>
>>>>>>>>
>>>>>>>>  --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com
>>>>>>>> .
>>>>>>>> To unsubscribe from this group, send email to
>>>>>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
>>>>>>>> .
>>>>>>>> For more options, visit this group at
>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>>>
>>>>>>>
>>>>>>>  --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>>>>> To unsubscribe from this group, send email to
>>>>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
>>>>>>> .
>>>>>>> For more options, visit this group at
>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>>
>>>>>>
>>>>>>  --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>>>> To unsubscribe from this group, send email to
>>>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
>>>>>> .
>>>>>> For more options, visit this group at
>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>
>>>>>
>>>>>  --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>>> To unsubscribe from this group, send email to
>>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
>>>>> .
>>>>> For more options, visit this group at
>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>
>>>>
>>>>  --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>> To unsubscribe from this group, send email to
>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
>>>> .
>>>> For more options, visit this group at
>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google Groups
>>> "tesseract-ocr" group.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>> To unsubscribe from this group, send email to
>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
>>> .
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> To unsubscribe from this group, send email to
>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
>> .
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> To unsubscribe from this group, send email to
> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
> .
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to