---------- Forwarded message ---------- From: Sriranga(78yrsold) <withblessi...@gmail.com> Date: Fri, Jan 21, 2011 at 12:33 PM Subject: Re: Tesseract Training To: KHEM Sochenda <khemsoche...@gmail.com>
Chenda, It is better to type the character (your lang script) than code in the box file. Because your characters will find in the unicharset file. I don't know whether your keyboard is able to type your lang and if so, it is better to type. On Fri, Jan 21, 2011 at 11:41 AM, Sriranga(78yrsold) < withblessi...@gmail.com> wrote: > Chenda, > By guess method I have edited the box file using another tool olwer.exe > (which is for english only)attached herewith. Advantage of attached > owler.exe is you can type character/ hexdecimal code by pressing tab. > consonant and independent vowel may have *single box* but for > consonant/independent vowel +*dependent vowel* must have single box. (the > said owler box is not suitable for kannada and as such I am not using) > If the output using same tif file(used for training) should naturally > correctly displayed. If used tif other than tif used for training purpose > will have naturally have misspelling which can be corrected by post > processor software. the same problem occurred for kannada also. I hope you > will succeed in generating trained data file successfully since there is no > more complex than Kannada script. > After receipt of corrected the box file, I shall generated trained data > file. > > With Best Wishes, > -sriranga(78yrs) > > > > On Fri, Jan 21, 2011 at 7:49 AM, KHEM Sochenda <khemsoche...@gmail.com>wrote: > >> Dear Dmitry and Sriranga, >> >> Here are my result of training. I tried recognize with the same used the >> trained image as a test, the result is perfect. When I tried with the test >> image as attached, there seem problem recognizing the characters. >> >> Please tell me what your thoughts about this. >> >> Best Regards, >> >> Sochenda >> >> >> On Thu, Jan 20, 2011 at 11:47 PM, KHEM Sochenda >> <khemsoche...@gmail.com>wrote: >> >>> >>> Dear Sriranga, >>> >>> Here is my train box. It is really tedious editing box file. I just found >>> some glyphs I haven't put the code for them yet, but it difficult to find >>> them in the editing box you gave neigther with pytesseracttrainer.py as it >>> is too slow.. >>> >>> Best Regards, >>> >>> Sochenda >>> >>> On Thu, Jan 20, 2011 at 4:49 PM, Sriranga(78yrsold) < >>> withblessi...@gmail.com> wrote: >>> >>>> **box file for editing >>>> >>>> >>>> >>>> On Thu, Jan 20, 2011 at 2:46 PM, KHEM Sochenda >>>> <khemsoche...@gmail.com>wrote: >>>> >>>>> Dear Dmitry and Sriranga, >>>>> >>>>> But, Sriranga, I guess your computer cannot render KH language well. I >>>>> will send you an image instead ok? >>>>> >>>>> Best Regards, >>>>> Sochenda >>>>> >>>>> >>>>> On Thu, Jan 20, 2011 at 4:08 PM, Sriranga(78yrsold) < >>>>> withblessi...@gmail.com> wrote: >>>>> >>>>>> Attached zip file containing exe file of owler. Before unzip please >>>>>> delete word {"OM" }first and then unzip >>>>>> with help owler, you edit box file according to your requirement >>>>>> After duly edited box file please forward to me >>>>>> for further generating traineddata file or if you are able to >>>>>> generate traineddata file you can do yourself - no problem. . >>>>>> With best of Luck, >>>>>> -sriranga(78yrs) >>>>>> Dear dmitry, >>>>>> Sorry, I could not post in the forum due to attahed files.Hence I am >>>>>> endorsing copy to you. >>>>>> >>>>>> On Thu, Jan 20, 2011 at 2:22 PM, Sriranga(78yrsold) < >>>>>> withblessi...@gmail.com> wrote: >>>>>> >>>>>>> Sochenda >>>>>>> please find attached box with its khtext.png file for editing in the >>>>>>> box file I am sending separately to you -khtext.tif and owler tool for >>>>>>> your >>>>>>> editing purpose. since I don't know khemer lang nor unable to type in >>>>>>> the >>>>>>> keyboard. After editing the box file and return to me for further >>>>>>> processing. >>>>>>> >>>>>>> With best of Luck, >>>>>>> -sriranga(78yrs) >>>>>>> >>>>>>> 2011/1/20 KHEM Sochenda <khemsoche...@gmail.com> >>>>>>> >>>>>>> >>>>>>>> Dear Dmitry and Sriranga, >>>>>>>> >>>>>>>> I am so confused now. :( >>>>>>>> >>>>>>>> Maybe I should apply for internship with tesseract, but I am so >>>>>>>> engaged with my project here. >>>>>>>> >>>>>>>> Please find the attachment as KHtext in unicode for training sample. >>>>>>>> >>>>>>>> >>>>>>>> Best Regards, >>>>>>>> >>>>>>>> Sochenda >>>>>>>> >>>>>>>> 2011/1/19 Sriranga(78yrsold) <withblessi...@gmail.com> >>>>>>>> >>>>>>>> Sochenda, >>>>>>>>> output of *lines viz.0ccb 8, 0cd5 8, 20c88 are appeared in >>>>>>>>> vowel1.txt. So we have to convert unicode numbers to Kannada >>>>>>>>> Character(script) with help of post-processor)* >>>>>>>>> -Regards, >>>>>>>>> -sriranga(78yrs) >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Jan 19, 2011 at 4:04 PM, Sriranga(78yrsold) < >>>>>>>>> withblessi...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Sochenda, >>>>>>>>>> pleas see inline reply below. >>>>>>>>>> >>>>>>>>>> On Wed, Jan 19, 2011 at 12:58 PM, KHEM Sochenda < >>>>>>>>>> khemsoche...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Dear Dmitry and Sriranga, >>>>>>>>>>> >>>>>>>>>>> Thank you very much for you help. The reason why my output file >>>>>>>>>>> is empty because I put my person ID to the glyphs, isn't it? >>>>>>>>>>> >>>>>>>>>>> Dear Dmitry, >>>>>>>>>>> Please see the image attached, shall the image in the red box >>>>>>>>>>> assigned to a Unicode character or seperated as in the image? This >>>>>>>>>>> glyph is >>>>>>>>>>> composed of two other glyphs-- one can be represented by a Unicode >>>>>>>>>>> character, and the other is a part of a vowel. >>>>>>>>>>> >>>>>>>>>>> Dear Sriranga, >>>>>>>>>>> >>>>>>>>>>> Are the several first lines in your unicharset files represent a >>>>>>>>>>> characters, or just any unicode character represent no any >>>>>>>>>>> character. >>>>>>>>>>> *These lines viz.0ccb 8, 0cd5 8, 20c88 , 30ce0 are unicode >>>>>>>>>>> number instead of characters* *of Kannada* *to show you*. *Usually >>>>>>>>>>> I am using characters(Script) instead of unicode number for training >>>>>>>>>>> purpose. I am using tesseract 3.01 alpha(r-529) >>>>>>>>>>> * >>>>>>>>>>> Khmer font is also attached. Thanks but unable to type. However >>>>>>>>>>> it appeared in CharacterMap. >>>>>>>>>>> >>>>>>>>>> On receipt of your alphabets list I shall generated datafiles >>>>>>>>>> and forwarded to you. >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Best Regards, >>>>>>>>>>> Sochenda >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Jan 18, 2011 at 8:27 PM, Dmitry Silaev < >>>>>>>>>>> daemons2...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Dear Sochenda, >>>>>>>>>>>> >>>>>>>>>>>> In addition to what Sriranga said I'd remind that you should do >>>>>>>>>>>> a lot of manual work: >>>>>>>>>>>> >>>>>>>>>>>> In pyTesseractTrainer check that no bounding boxes intersect >>>>>>>>>>>> glyphs; if some does - correct its BB coordinates manually. >>>>>>>>>>>> >>>>>>>>>>>> In cases of BB overlap you should space out participating glyphs >>>>>>>>>>>> in the training image (see the attached picture for examples). >>>>>>>>>>>> >>>>>>>>>>>> You should use manual spacing if participating glyphs are >>>>>>>>>>>> dependent characters (in your language - vowels) and the number of >>>>>>>>>>>> possible >>>>>>>>>>>> combinations is practically uncountable. Then you would assign >>>>>>>>>>>> every glyph >>>>>>>>>>>> its own code. Tess would consider these glyphs as separate >>>>>>>>>>>> characters and >>>>>>>>>>>> you should post-process the resulting code sequence to obtain a >>>>>>>>>>>> well-formed >>>>>>>>>>>> dependent Unicode pair (or triplet). >>>>>>>>>>>> >>>>>>>>>>>> If there can be only few such combinations - you can merge these >>>>>>>>>>>> BBs into one to encompass all the required glyphs and assign a >>>>>>>>>>>> single code >>>>>>>>>>>> to the entire glyph combination. Then during the post-processing >>>>>>>>>>>> you'll need >>>>>>>>>>>> to replace this single code with a predefined dependent Unicode >>>>>>>>>>>> pair. >>>>>>>>>>>> >>>>>>>>>>>> Hope I've managed to express myself clearly. >>>>>>>>>>>> >>>>>>>>>>>> Warm regards, >>>>>>>>>>>> Dmitry Silaev >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>> To post to this group, send email to >>>>>>>>>>>> tesseract-ocr@googlegroups.com. >>>>>>>>>>>> To unsubscribe from this group, send email to >>>>>>>>>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com> >>>>>>>>>>>> . >>>>>>>>>>>> For more options, visit this group at >>>>>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>> To post to this group, send email to >>>>>>>>>>> tesseract-ocr@googlegroups.com. >>>>>>>>>>> To unsubscribe from this group, send email to >>>>>>>>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com> >>>>>>>>>>> . >>>>>>>>>>> For more options, visit this group at >>>>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.