Fwd: Tesseract Training

Sriranga(78yrsold) Fri, 21 Jan 2011 23:48:51 -0800

---------- Forwarded message ----------
From: Sriranga(78yrsold) <withblessi...@gmail.com>
Date: Fri, Jan 21, 2011 at 12:33 PM
Subject: Re: Tesseract Training
To: KHEM Sochenda <khemsoche...@gmail.com>



Chenda,
It is better to type the character (your lang script) than code in the box
file. Because your characters will find  in the unicharset file. I don't
know whether your keyboard is able to type your lang and if so, it is better
to type.


On Fri, Jan 21, 2011 at 11:41 AM, Sriranga(78yrsold) <
withblessi...@gmail.com> wrote:

> Chenda,
>  By guess method I have edited the box file using another tool olwer.exe
> (which is for english only)attached herewith. Advantage of attached
> owler.exe is you can type character/ hexdecimal code by pressing tab.
> consonant and independent vowel may have *single box* but for
> consonant/independent vowel +*dependent vowel* must have single box. (the
> said owler box is not suitable for kannada and as such I am not using)
> If the output using same tif file(used for training) should naturally
> correctly displayed. If used tif other than tif used for training purpose
> will have naturally have misspelling which can be corrected by post
> processor software. the same problem occurred for kannada also.  I hope you
> will succeed in generating trained data file successfully since there is no
> more complex than Kannada script.
> After receipt of  corrected the box file, I shall generated trained data
> file.
>
> With Best Wishes,
> -sriranga(78yrs)
>
>
>
> On Fri, Jan 21, 2011 at 7:49 AM, KHEM Sochenda <khemsoche...@gmail.com>wrote:
>
>> Dear Dmitry and Sriranga,
>>
>> Here are my result of training. I tried recognize with the same used the
>> trained image as a test, the result is perfect. When I tried with the test
>> image as attached, there seem problem recognizing the characters.
>>
>> Please tell me what your thoughts about this.
>>
>> Best Regards,
>>
>> Sochenda
>>
>>
>> On Thu, Jan 20, 2011 at 11:47 PM, KHEM Sochenda 
>> <khemsoche...@gmail.com>wrote:
>>
>>>
>>> Dear Sriranga,
>>>
>>> Here is my train box. It is really tedious editing box file. I just found
>>> some glyphs I haven't put the code for them yet, but it difficult to find
>>> them in the editing box you gave neigther with pytesseracttrainer.py as it
>>> is too slow..
>>>
>>> Best Regards,
>>>
>>> Sochenda
>>>
>>> On Thu, Jan 20, 2011 at 4:49 PM, Sriranga(78yrsold) <
>>> withblessi...@gmail.com> wrote:
>>>
>>>> **box file for editing
>>>>
>>>>
>>>>
>>>> On Thu, Jan 20, 2011 at 2:46 PM, KHEM Sochenda 
>>>> <khemsoche...@gmail.com>wrote:
>>>>
>>>>> Dear Dmitry and Sriranga,
>>>>>
>>>>> But, Sriranga, I guess your computer cannot render KH language well. I
>>>>> will send you an image instead ok?
>>>>>
>>>>> Best Regards,
>>>>> Sochenda
>>>>>
>>>>>
>>>>> On Thu, Jan 20, 2011 at 4:08 PM, Sriranga(78yrsold) <
>>>>> withblessi...@gmail.com> wrote:
>>>>>
>>>>>> Attached zip file containing exe file of owler. Before unzip please
>>>>>> delete word {"OM" }first and then unzip
>>>>>> with help owler, you edit box file according to your requirement
>>>>>> After duly edited box file  please forward to me
>>>>>> for further generating traineddata file or if you  are able to
>>>>>> generate traineddata file  you can do yourself - no problem. .
>>>>>> With best of Luck,
>>>>>> -sriranga(78yrs)
>>>>>> Dear dmitry,
>>>>>> Sorry, I could not post in the forum due to attahed files.Hence I am
>>>>>> endorsing copy to you.
>>>>>>
>>>>>> On Thu, Jan 20, 2011 at 2:22 PM, Sriranga(78yrsold) <
>>>>>> withblessi...@gmail.com> wrote:
>>>>>>
>>>>>>> Sochenda
>>>>>>> please find attached box with its khtext.png file for editing in the
>>>>>>> box file  I am sending separately to you -khtext.tif and owler tool for 
>>>>>>> your
>>>>>>> editing purpose. since I don't know khemer lang nor unable to type in 
>>>>>>> the
>>>>>>> keyboard. After editing the box file and return to me for further
>>>>>>> processing.
>>>>>>>
>>>>>>> With best of Luck,
>>>>>>> -sriranga(78yrs)
>>>>>>>
>>>>>>> 2011/1/20 KHEM Sochenda <khemsoche...@gmail.com>
>>>>>>>
>>>>>>>
>>>>>>>> Dear Dmitry and Sriranga,
>>>>>>>>
>>>>>>>> I am so confused now. :(
>>>>>>>>
>>>>>>>> Maybe I should apply for internship with tesseract, but I am so
>>>>>>>> engaged with my project here.
>>>>>>>>
>>>>>>>> Please find the attachment as KHtext in unicode for training sample.
>>>>>>>>
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>>
>>>>>>>> Sochenda
>>>>>>>>
>>>>>>>> 2011/1/19 Sriranga(78yrsold) <withblessi...@gmail.com>
>>>>>>>>
>>>>>>>> Sochenda,
>>>>>>>>> output of *lines viz.0ccb 8, 0cd5 8,  20c88 are appeared in
>>>>>>>>> vowel1.txt. So we have to convert unicode numbers to Kannada
>>>>>>>>> Character(script) with help of post-processor)*
>>>>>>>>> -Regards,
>>>>>>>>> -sriranga(78yrs)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jan 19, 2011 at 4:04 PM, Sriranga(78yrsold) <
>>>>>>>>> withblessi...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Sochenda,
>>>>>>>>>> pleas see inline reply below.
>>>>>>>>>>
>>>>>>>>>> On Wed, Jan 19, 2011 at 12:58 PM, KHEM Sochenda <
>>>>>>>>>> khemsoche...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Dear Dmitry and Sriranga,
>>>>>>>>>>>
>>>>>>>>>>> Thank you very much for you help. The reason why my output file
>>>>>>>>>>> is empty because I put my person ID to the glyphs, isn't it?
>>>>>>>>>>>
>>>>>>>>>>> Dear Dmitry,
>>>>>>>>>>> Please see the image attached, shall the image in the red box
>>>>>>>>>>> assigned to a Unicode character or seperated as in the image? This 
>>>>>>>>>>> glyph is
>>>>>>>>>>> composed of two other glyphs-- one can be represented by a Unicode
>>>>>>>>>>> character, and the other is a part of a vowel.
>>>>>>>>>>>
>>>>>>>>>>> Dear Sriranga,
>>>>>>>>>>>
>>>>>>>>>>> Are the several first lines in your unicharset files represent a
>>>>>>>>>>> characters, or just any unicode character represent no any 
>>>>>>>>>>> character.
>>>>>>>>>>> *These lines viz.0ccb 8, 0cd5 8,  20c88 , 30ce0 are unicode
>>>>>>>>>>> number instead of  characters* *of Kannada* *to show you*. *Usually
>>>>>>>>>>> I am using characters(Script) instead of unicode number for training
>>>>>>>>>>> purpose.  I am using tesseract 3.01 alpha(r-529)
>>>>>>>>>>> *
>>>>>>>>>>> Khmer font is also attached. Thanks but unable to type. However
>>>>>>>>>>> it appeared in CharacterMap.
>>>>>>>>>>>
>>>>>>>>>>   On receipt of your alphabets list I shall generated datafiles
>>>>>>>>>> and forwarded to you.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Best Regards,
>>>>>>>>>>> Sochenda
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jan 18, 2011 at 8:27 PM, Dmitry Silaev <
>>>>>>>>>>> daemons2...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Dear Sochenda,
>>>>>>>>>>>>
>>>>>>>>>>>> In addition to what Sriranga said I'd remind that you should do
>>>>>>>>>>>> a lot of manual work:
>>>>>>>>>>>>
>>>>>>>>>>>> In pyTesseractTrainer check that no bounding boxes intersect
>>>>>>>>>>>> glyphs; if some does - correct its BB coordinates manually.
>>>>>>>>>>>>
>>>>>>>>>>>> In cases of BB overlap you should space out participating glyphs
>>>>>>>>>>>> in the training image (see the attached picture for examples).
>>>>>>>>>>>>
>>>>>>>>>>>> You should use manual spacing if participating glyphs are
>>>>>>>>>>>> dependent characters (in your language - vowels) and the number of 
>>>>>>>>>>>> possible
>>>>>>>>>>>> combinations is practically uncountable. Then you would assign 
>>>>>>>>>>>> every glyph
>>>>>>>>>>>> its own code. Tess would consider these glyphs as separate 
>>>>>>>>>>>> characters and
>>>>>>>>>>>> you should post-process the resulting code sequence to obtain a 
>>>>>>>>>>>> well-formed
>>>>>>>>>>>> dependent Unicode pair (or triplet).
>>>>>>>>>>>>
>>>>>>>>>>>> If there can be only few such combinations - you can merge these
>>>>>>>>>>>> BBs into one to encompass all the required glyphs and assign a 
>>>>>>>>>>>> single code
>>>>>>>>>>>> to the entire glyph combination. Then during the post-processing 
>>>>>>>>>>>> you'll need
>>>>>>>>>>>> to replace this single code with a predefined dependent Unicode 
>>>>>>>>>>>> pair.
>>>>>>>>>>>>
>>>>>>>>>>>> Hope I've managed to express myself clearly.
>>>>>>>>>>>>
>>>>>>>>>>>> Warm regards,
>>>>>>>>>>>> Dmitry Silaev
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  --
>>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>> To post to this group, send email to
>>>>>>>>>>>> tesseract-ocr@googlegroups.com.
>>>>>>>>>>>> To unsubscribe from this group, send email to
>>>>>>>>>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
>>>>>>>>>>>> .
>>>>>>>>>>>> For more options, visit this group at
>>>>>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  --
>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>> To post to this group, send email to
>>>>>>>>>>> tesseract-ocr@googlegroups.com.
>>>>>>>>>>> To unsubscribe from this group, send email to
>>>>>>>>>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
>>>>>>>>>>> .
>>>>>>>>>>> For more options, visit this group at
>>>>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Fwd: Tesseract Training

Reply via email to