this link will lead you to Khmer Unicode page
http://unicode.org/charts/PDF/U1780.pdf

On Mon, Jan 17, 2011 at 2:06 PM, Sriranga(78yrsold) <withblessi...@gmail.com
> wrote:

> Viewed Khemer unicode chart (pdf) there are dependent vowels are there. It
> is better to use bbtool to generate box file. please see wiki section for
> tools.
>
>
> On Mon, Jan 17, 2011 at 12:24 PM, Sriranga(78yrsold) <
> withblessi...@gmail.com> wrote:
>
>> Is there are dependent vowel in your Khemer lang. If you have unicode
>> chart  better to upload
>>
>>
>> On Mon, Jan 17, 2011 at 12:13 PM, KHEM Sochenda 
>> <khemsoche...@gmail.com>wrote:
>>
>>> I know how to do it in tesseract, but the image just to show you how the
>>> glyphs should be boxed.
>>>
>>> I can send you the box file generate by tesseract anyway.
>>>
>>> Regards,
>>>
>>> Sochenda
>>>
>>>
>>> On Mon, Jan 17, 2011 at 1:41 PM, Sriranga(78yrsold) <
>>> withblessi...@gmail.com> wrote:
>>>
>>>> as per wiki instructions.- commandline has to be used to generate box
>>>> file as follow - as per wiki instructions.
>>>> tesseract <lang.fontname.number.tif >   <lang.fontname.number>
>>>> batch.nochop makebox
>>>>
>>>>
>>>>
>>>> On Mon, Jan 17, 2011 at 11:55 AM, KHEM Sochenda <khemsoche...@gmail.com
>>>> > wrote:
>>>>
>>>>> In the image, I've done manually.
>>>>>
>>>>> On Mon, Jan 17, 2011 at 12:16 PM, Sriranga(78yrsold) <
>>>>> withblessi...@gmail.com> wrote:
>>>>>
>>>>>> Which tool you have used to create boxes. Please also upload box file
>>>>>> generated by you.
>>>>>>
>>>>>>
>>>>>> On Mon, Jan 17, 2011 at 9:31 AM, KHEM Sochenda <
>>>>>> khemsoche...@gmail.com> wrote:
>>>>>>
>>>>>>> Dear Dmitry,
>>>>>>>
>>>>>>> Thank you again for a very quick response.
>>>>>>>
>>>>>>> I am going to train tesseract for Khmer Language in which there are
>>>>>>> many ligatures are in the same cases as "fi" in some latin fonts.
>>>>>>> The attachment show you the example of the one line khmer sentence,
>>>>>>> please count the box from left to right. You can see that some glyphs 
>>>>>>> are
>>>>>>> above to others. The first glyph is formed of two unicode characters,
>>>>>>> somehow the third glyph and the fifth glyph form a Unicode characters. 
>>>>>>> This
>>>>>>> is the reason why I wish to give each glype its own ID and then I do a 
>>>>>>> post
>>>>>>> processing afterward.
>>>>>>>
>>>>>>> Regarding the two glyphs which are overlapped each other like the
>>>>>>> case of 7th glyph and the 8th glyph, how tesseract will segment these
>>>>>>> glyphs?  How to give the position of the boxes?
>>>>>>>
>>>>>>>
>>>>>>> Thank you very much in advance for your response.
>>>>>>>
>>>>>>>
>>>>>>> Best Regards,
>>>>>>>
>>>>>>> Sochenda
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Jan 16, 2011 at 3:48 PM, Dmitry Silaev <
>>>>>>> daemons2...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Dear Sochenda,
>>>>>>>>
>>>>>>>> I'm not sure what's the ultimate goal of your code assignment but a
>>>>>>>> formal answer to your question is "Yes". You can assign "k001" or 
>>>>>>>> "k002" to
>>>>>>>> a bounding box in a .box file. Moreover, you can assign any UTF-8 
>>>>>>>> encoded
>>>>>>>> character sequence. In Tess version 3.0x (current) the only 
>>>>>>>> restriction is a
>>>>>>>> 24 byte limit for the entire char sequence length. This also allows 
>>>>>>>> you to
>>>>>>>> use not only an abstract code like "k001" but a meaningful character
>>>>>>>> sequence from your real language (e.g. a well-known "fi" ligature in 
>>>>>>>> some
>>>>>>>> Latin fonts) which then relieves you from using the pre- and
>>>>>>>> post-processing.
>>>>>>>>
>>>>>>>> If you still prefer using abstract codes then pre-/post-processing
>>>>>>>> can be done without tinkering with Tess's code. Since training as well 
>>>>>>>> as
>>>>>>>> recognition result in generation of output files, you can develop a 
>>>>>>>> couple
>>>>>>>> of file processing command-line utilities which then can be used along 
>>>>>>>> with
>>>>>>>> calls to the Tesseract executable within shell scripts (or .bat files 
>>>>>>>> in
>>>>>>>> Windows).
>>>>>>>>
>>>>>>>> For further details you definitely should study thoroughly the
>>>>>>>> "TrainingTesseract3" and "ReadMe" (section "Installation Notes - 
>>>>>>>> Tesseract
>>>>>>>> 3.00") documents (
>>>>>>>> http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 and
>>>>>>>> http://code.google.com/p/tesseract-ocr/wiki/ReadMe). These are not
>>>>>>>> quite easy searchable documents but they contain all the info you might
>>>>>>>> need.
>>>>>>>>
>>>>>>>> Warm regards,
>>>>>>>> Dmitry Silaev
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Jan 16, 2011 at 10:42 AM, KHEM Sochenda <
>>>>>>>> khemsoche...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Dear Dmitry,
>>>>>>>>>
>>>>>>>>> Thank you very much for a comprehensive explanation.
>>>>>>>>> Let say, to go straight, does it sound ok by assigning a code like
>>>>>>>>> 'k001' or 'k002' to the glype obtain from tesseract segmentation?
>>>>>>>>>
>>>>>>>>> For post processing, touching the code tesseract, could you please
>>>>>>>>> point me out which I files I should modify to work on. Advice me if 
>>>>>>>>> the last
>>>>>>>>> version of tesseract will do fine.
>>>>>>>>>
>>>>>>>>> Thank you very much in advance for your time and response back.
>>>>>>>>>
>>>>>>>>> Best Regards,
>>>>>>>>>
>>>>>>>>> Sochenda
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Jan 15, 2011 at 3:05 AM, Dmitry Silaev <
>>>>>>>>> daemons2...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Chenda,
>>>>>>>>>>
>>>>>>>>>> In fact Tesseract doesn't care if you do training for a real
>>>>>>>>>> language's letter and which language this letter belongs to. 
>>>>>>>>>> Simplistically
>>>>>>>>>> saying Tess only saves the mapping of feature sets obtained from 
>>>>>>>>>> training to
>>>>>>>>>> Unicode ids. This implies that during training you can assign 
>>>>>>>>>> virtually any
>>>>>>>>>> character code to virtually any glyph (to be exact, to a connected 
>>>>>>>>>> component
>>>>>>>>>> or to a set of connected components).
>>>>>>>>>>
>>>>>>>>>> If your language script is comprised by a reasonable number of
>>>>>>>>>> joint character combinations then while training you can assign 
>>>>>>>>>> every such
>>>>>>>>>> combination a predefined Unicode id (some restrictions apply). 
>>>>>>>>>> Later, when
>>>>>>>>>> running recognition, you should do some post-processing to decode 
>>>>>>>>>> your
>>>>>>>>>> predefined ids into real language's character sequences.
>>>>>>>>>>
>>>>>>>>>> For good results all this requires you to develop a training file
>>>>>>>>>> pre-processor (mapping: language char combinations -> provisional 
>>>>>>>>>> ids) and a
>>>>>>>>>> recognition result post-processor (mapping: provisional ids -> 
>>>>>>>>>> language char
>>>>>>>>>> sequences). I'm not sure but this also may require correcting 
>>>>>>>>>> character
>>>>>>>>>> property bit masks in the unicharset file (I don't know exactly how 
>>>>>>>>>> this
>>>>>>>>>> information is used by Tess as I don't need it in my project).
>>>>>>>>>>
>>>>>>>>>> Warm regards,
>>>>>>>>>> Dmitry Silaev
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Jan 14, 2011 at 10:25 AM, KHEM Sochenda <
>>>>>>>>>> khemsoche...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Dear Tesseract Team,
>>>>>>>>>>>
>>>>>>>>>>> In training new language step, we have to assign a unicode value
>>>>>>>>>>> to each box.
>>>>>>>>>>> I would like to know if a shape that is composed of *several
>>>>>>>>>>> unicode characters?
>>>>>>>>>>> Is there anyway to assign only an id for each box in tesseract?
>>>>>>>>>>>
>>>>>>>>>>> Thank you very much in advance for your response.
>>>>>>>>>>>
>>>>>>>>>>> Best Regards,
>>>>>>>>>>> Chenda *
>>>>>>>>>>>
>>>>>>>>>>>    1. **
>>>>>>>>>>>
>>>>>>>>>>>  --
>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>> To post to this group, send email to
>>>>>>>>>>> tesseract-ocr@googlegroups.com.
>>>>>>>>>>> To unsubscribe from this group, send email to
>>>>>>>>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
>>>>>>>>>>> .
>>>>>>>>>>> For more options, visit this group at
>>>>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  --
>>>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>>>> To post to this group, send email to
>>>>>>>>>> tesseract-ocr@googlegroups.com.
>>>>>>>>>> To unsubscribe from this group, send email to
>>>>>>>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
>>>>>>>>>> .
>>>>>>>>>> For more options, visit this group at
>>>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  --
>>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>>> To post to this group, send email to
>>>>>>>>> tesseract-ocr@googlegroups.com.
>>>>>>>>> To unsubscribe from this group, send email to
>>>>>>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
>>>>>>>>> .
>>>>>>>>> For more options, visit this group at
>>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>>>>
>>>>>>>>
>>>>>>>>  --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com
>>>>>>>> .
>>>>>>>> To unsubscribe from this group, send email to
>>>>>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
>>>>>>>> .
>>>>>>>> For more options, visit this group at
>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>>>
>>>>>>>
>>>>>>>  --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>>>>> To unsubscribe from this group, send email to
>>>>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
>>>>>>> .
>>>>>>> For more options, visit this group at
>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>>
>>>>>>
>>>>>>  --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>>>> To unsubscribe from this group, send email to
>>>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
>>>>>> .
>>>>>> For more options, visit this group at
>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>
>>>>>
>>>>>  --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>>> To unsubscribe from this group, send email to
>>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
>>>>> .
>>>>> For more options, visit this group at
>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>
>>>>
>>>>  --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>> To unsubscribe from this group, send email to
>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
>>>> .
>>>> For more options, visit this group at
>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google Groups
>>> "tesseract-ocr" group.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>> To unsubscribe from this group, send email to
>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
>>> .
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>
>>
>>
>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> To unsubscribe from this group, send email to
> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
> .
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to