I am using windows XP; occasionally CentOS On Mon, Jan 17, 2011 at 2:16 PM, Sriranga(78yrsold) <withblessi...@gmail.com > wrote:
> From Pdf it is observed thare are number of dependent vowels existed. The > case is similar to Indic lang. > Let me know which OS you are using? > > > On Mon, Jan 17, 2011 at 12:42 PM, KHEM Sochenda <khemsoche...@gmail.com>wrote: > >> this link will lead you to Khmer Unicode page >> http://unicode.org/charts/PDF/U1780.pdf >> >> >> On Mon, Jan 17, 2011 at 2:06 PM, Sriranga(78yrsold) < >> withblessi...@gmail.com> wrote: >> >>> Viewed Khemer unicode chart (pdf) there are dependent vowels are there. >>> It is better to use bbtool to generate box file. please see wiki section for >>> tools. >>> >>> >>> On Mon, Jan 17, 2011 at 12:24 PM, Sriranga(78yrsold) < >>> withblessi...@gmail.com> wrote: >>> >>>> Is there are dependent vowel in your Khemer lang. If you have unicode >>>> chart better to upload >>>> >>>> >>>> On Mon, Jan 17, 2011 at 12:13 PM, KHEM Sochenda <khemsoche...@gmail.com >>>> > wrote: >>>> >>>>> I know how to do it in tesseract, but the image just to show you how >>>>> the glyphs should be boxed. >>>>> >>>>> I can send you the box file generate by tesseract anyway. >>>>> >>>>> Regards, >>>>> >>>>> Sochenda >>>>> >>>>> >>>>> On Mon, Jan 17, 2011 at 1:41 PM, Sriranga(78yrsold) < >>>>> withblessi...@gmail.com> wrote: >>>>> >>>>>> as per wiki instructions.- commandline has to be used to generate box >>>>>> file as follow - as per wiki instructions. >>>>>> tesseract <lang.fontname.number.tif > <lang.fontname.number> >>>>>> batch.nochop makebox >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Jan 17, 2011 at 11:55 AM, KHEM Sochenda < >>>>>> khemsoche...@gmail.com> wrote: >>>>>> >>>>>>> In the image, I've done manually. >>>>>>> >>>>>>> On Mon, Jan 17, 2011 at 12:16 PM, Sriranga(78yrsold) < >>>>>>> withblessi...@gmail.com> wrote: >>>>>>> >>>>>>>> Which tool you have used to create boxes. Please also upload box >>>>>>>> file generated by you. >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Jan 17, 2011 at 9:31 AM, KHEM Sochenda < >>>>>>>> khemsoche...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Dear Dmitry, >>>>>>>>> >>>>>>>>> Thank you again for a very quick response. >>>>>>>>> >>>>>>>>> I am going to train tesseract for Khmer Language in which there are >>>>>>>>> many ligatures are in the same cases as "fi" in some latin fonts. >>>>>>>>> The attachment show you the example of the one line khmer sentence, >>>>>>>>> please count the box from left to right. You can see that some glyphs >>>>>>>>> are >>>>>>>>> above to others. The first glyph is formed of two unicode characters, >>>>>>>>> somehow the third glyph and the fifth glyph form a Unicode >>>>>>>>> characters. This >>>>>>>>> is the reason why I wish to give each glype its own ID and then I do >>>>>>>>> a post >>>>>>>>> processing afterward. >>>>>>>>> >>>>>>>>> Regarding the two glyphs which are overlapped each other like the >>>>>>>>> case of 7th glyph and the 8th glyph, how tesseract will segment these >>>>>>>>> glyphs? How to give the position of the boxes? >>>>>>>>> >>>>>>>>> >>>>>>>>> Thank you very much in advance for your response. >>>>>>>>> >>>>>>>>> >>>>>>>>> Best Regards, >>>>>>>>> >>>>>>>>> Sochenda >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sun, Jan 16, 2011 at 3:48 PM, Dmitry Silaev < >>>>>>>>> daemons2...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Dear Sochenda, >>>>>>>>>> >>>>>>>>>> I'm not sure what's the ultimate goal of your code assignment but >>>>>>>>>> a formal answer to your question is "Yes". You can assign "k001" or >>>>>>>>>> "k002" >>>>>>>>>> to a bounding box in a .box file. Moreover, you can assign any UTF-8 >>>>>>>>>> encoded >>>>>>>>>> character sequence. In Tess version 3.0x (current) the only >>>>>>>>>> restriction is a >>>>>>>>>> 24 byte limit for the entire char sequence length. This also allows >>>>>>>>>> you to >>>>>>>>>> use not only an abstract code like "k001" but a meaningful character >>>>>>>>>> sequence from your real language (e.g. a well-known "fi" ligature in >>>>>>>>>> some >>>>>>>>>> Latin fonts) which then relieves you from using the pre- and >>>>>>>>>> post-processing. >>>>>>>>>> >>>>>>>>>> If you still prefer using abstract codes then pre-/post-processing >>>>>>>>>> can be done without tinkering with Tess's code. Since training as >>>>>>>>>> well as >>>>>>>>>> recognition result in generation of output files, you can develop a >>>>>>>>>> couple >>>>>>>>>> of file processing command-line utilities which then can be used >>>>>>>>>> along with >>>>>>>>>> calls to the Tesseract executable within shell scripts (or .bat >>>>>>>>>> files in >>>>>>>>>> Windows). >>>>>>>>>> >>>>>>>>>> For further details you definitely should study thoroughly the >>>>>>>>>> "TrainingTesseract3" and "ReadMe" (section "Installation Notes - >>>>>>>>>> Tesseract >>>>>>>>>> 3.00") documents ( >>>>>>>>>> http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3and >>>>>>>>>> http://code.google.com/p/tesseract-ocr/wiki/ReadMe). These are >>>>>>>>>> not quite easy searchable documents but they contain all the info >>>>>>>>>> you might >>>>>>>>>> need. >>>>>>>>>> >>>>>>>>>> Warm regards, >>>>>>>>>> Dmitry Silaev >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sun, Jan 16, 2011 at 10:42 AM, KHEM Sochenda < >>>>>>>>>> khemsoche...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Dear Dmitry, >>>>>>>>>>> >>>>>>>>>>> Thank you very much for a comprehensive explanation. >>>>>>>>>>> Let say, to go straight, does it sound ok by assigning a code >>>>>>>>>>> like 'k001' or 'k002' to the glype obtain from tesseract >>>>>>>>>>> segmentation? >>>>>>>>>>> >>>>>>>>>>> For post processing, touching the code tesseract, could you >>>>>>>>>>> please point me out which I files I should modify to work on. >>>>>>>>>>> Advice me if >>>>>>>>>>> the last version of tesseract will do fine. >>>>>>>>>>> >>>>>>>>>>> Thank you very much in advance for your time and response back. >>>>>>>>>>> >>>>>>>>>>> Best Regards, >>>>>>>>>>> >>>>>>>>>>> Sochenda >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Sat, Jan 15, 2011 at 3:05 AM, Dmitry Silaev < >>>>>>>>>>> daemons2...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Chenda, >>>>>>>>>>>> >>>>>>>>>>>> In fact Tesseract doesn't care if you do training for a real >>>>>>>>>>>> language's letter and which language this letter belongs to. >>>>>>>>>>>> Simplistically >>>>>>>>>>>> saying Tess only saves the mapping of feature sets obtained from >>>>>>>>>>>> training to >>>>>>>>>>>> Unicode ids. This implies that during training you can assign >>>>>>>>>>>> virtually any >>>>>>>>>>>> character code to virtually any glyph (to be exact, to a connected >>>>>>>>>>>> component >>>>>>>>>>>> or to a set of connected components). >>>>>>>>>>>> >>>>>>>>>>>> If your language script is comprised by a reasonable number of >>>>>>>>>>>> joint character combinations then while training you can assign >>>>>>>>>>>> every such >>>>>>>>>>>> combination a predefined Unicode id (some restrictions apply). >>>>>>>>>>>> Later, when >>>>>>>>>>>> running recognition, you should do some post-processing to decode >>>>>>>>>>>> your >>>>>>>>>>>> predefined ids into real language's character sequences. >>>>>>>>>>>> >>>>>>>>>>>> For good results all this requires you to develop a training >>>>>>>>>>>> file pre-processor (mapping: language char combinations -> >>>>>>>>>>>> provisional ids) >>>>>>>>>>>> and a recognition result post-processor (mapping: provisional ids >>>>>>>>>>>> -> >>>>>>>>>>>> language char sequences). I'm not sure but this also may require >>>>>>>>>>>> correcting >>>>>>>>>>>> character property bit masks in the unicharset file (I don't know >>>>>>>>>>>> exactly >>>>>>>>>>>> how this information is used by Tess as I don't need it in my >>>>>>>>>>>> project). >>>>>>>>>>>> >>>>>>>>>>>> Warm regards, >>>>>>>>>>>> Dmitry Silaev >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Jan 14, 2011 at 10:25 AM, KHEM Sochenda < >>>>>>>>>>>> khemsoche...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Dear Tesseract Team, >>>>>>>>>>>>> >>>>>>>>>>>>> In training new language step, we have to assign a unicode >>>>>>>>>>>>> value to each box. >>>>>>>>>>>>> I would like to know if a shape that is composed of *several >>>>>>>>>>>>> unicode characters? >>>>>>>>>>>>> Is there anyway to assign only an id for each box in tesseract? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Thank you very much in advance for your response. >>>>>>>>>>>>> >>>>>>>>>>>>> Best Regards, >>>>>>>>>>>>> Chenda * >>>>>>>>>>>>> >>>>>>>>>>>>> 1. ** >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>>> To post to this group, send email to >>>>>>>>>>>>> tesseract-ocr@googlegroups.com. >>>>>>>>>>>>> To unsubscribe from this group, send email to >>>>>>>>>>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com> >>>>>>>>>>>>> . >>>>>>>>>>>>> For more options, visit this group at >>>>>>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>> To post to this group, send email to >>>>>>>>>>>> tesseract-ocr@googlegroups.com. >>>>>>>>>>>> To unsubscribe from this group, send email to >>>>>>>>>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com> >>>>>>>>>>>> . >>>>>>>>>>>> For more options, visit this group at >>>>>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>> To post to this group, send email to >>>>>>>>>>> tesseract-ocr@googlegroups.com. >>>>>>>>>>> To unsubscribe from this group, send email to >>>>>>>>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com> >>>>>>>>>>> . >>>>>>>>>>> For more options, visit this group at >>>>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>>> Groups "tesseract-ocr" group. >>>>>>>>>> To post to this group, send email to >>>>>>>>>> tesseract-ocr@googlegroups.com. >>>>>>>>>> To unsubscribe from this group, send email to >>>>>>>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com> >>>>>>>>>> . >>>>>>>>>> For more options, visit this group at >>>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "tesseract-ocr" group. >>>>>>>>> To post to this group, send email to >>>>>>>>> tesseract-ocr@googlegroups.com. >>>>>>>>> To unsubscribe from this group, send email to >>>>>>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com> >>>>>>>>> . >>>>>>>>> For more options, visit this group at >>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "tesseract-ocr" group. >>>>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com >>>>>>>> . >>>>>>>> To unsubscribe from this group, send email to >>>>>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com> >>>>>>>> . >>>>>>>> For more options, visit this group at >>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com. >>>>>>> To unsubscribe from this group, send email to >>>>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com> >>>>>>> . >>>>>>> For more options, visit this group at >>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>>>>> >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com. >>>>>> To unsubscribe from this group, send email to >>>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com> >>>>>> . >>>>>> For more options, visit this group at >>>>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>>>> >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To post to this group, send email to tesseract-ocr@googlegroups.com. >>>>> To unsubscribe from this group, send email to >>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com> >>>>> . >>>>> For more options, visit this group at >>>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>>> >>>> >>>> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "tesseract-ocr" group. >>> To post to this group, send email to tesseract-ocr@googlegroups.com. >>> To unsubscribe from this group, send email to >>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com> >>> . >>> For more options, visit this group at >>> http://groups.google.com/group/tesseract-ocr?hl=en. >>> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To post to this group, send email to tesseract-ocr@googlegroups.com. >> To unsubscribe from this group, send email to >> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com> >> . >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en. >> > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To post to this group, send email to tesseract-ocr@googlegroups.com. > To unsubscribe from this group, send email to > tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com> > . > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.