Re: [tesseract-ocr] Re: Regarding Tesseract OCR engine for recognizing Tamil Fonts

Shree Devi Kumar Thu, 07 Aug 2014 05:14:39 -0700

Hello Sibi,

Please see
https://sourceforge.net/projects/tesseracthindi/files/Tamil%20Training%20Files/
?


It has training files which can be used as start for Tamil script training
for Tesseract 3.02/03.
I am only familiar with the basics of tamil script hence these will require
changes and updates.
 tam.zip is a zip file with the traineddata, tif and box pairs and other
required files.
dir.txt lists alll the files available in the zip. These were produced
using Quan Nguyen's JTessBoxEditor and VIETOCR.

jTessBoxEditor v1.0


   - Integrate support for full automation of Tesseract training
   - Bundle Tesseract Windows training executables (r866), English data,
   and config files

VietOCR v4.0 Beta


   - Upgrade to Tesseract 3.03 RC (r1051)

THANK YOU, QUAN, for the software and your prompt response to my queries.
Unicharambigs will require to be modified or postprocessing will be
required for the vowel signs which both prepend and append the consonants
i.e. பொ 0BCA TAMIL VOWEL SIGN O (combined with pa (ப)) போ 0BCB TAMIL VOWEL
SIGN OO (combined with pa (ப)) பௌ 0BCC TAMIL VOWEL SIGN AU (combined with
pa (ப)) Changes will also be required for distinguishing between ள 0BB3
TAMIL LETTER LLA and the last part of பௌ 0BCC TAMIL VOWEL SIGN AU (combined
with pa (ப)) The files include tam.traineddata which can be used with
VIETOCR to test OCR of tamil texts.



Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


On Mon, Jul 21, 2014 at 2:45 PM, sibi kanagaraj <civil.si...@gmail.com>
wrote:

>
> Hi Shree ,
>
> Thank you for the input .
>
> I have started testing the .png file for Tamil . I have used image from
> Tamil Text book .
>
> Though an entire page was given as input , I would like to paste the *most
> *accurate result which I got . I am sure that the deviation is quite
> large .
>
> The problem with me is that , I dont know where to start reading and
> working on code . I see FAQ and suddenly jump to modules , then from there
> 2.0 or 3.0 confusion and it keeps growing .
>
> For the given input shown above
>
> The output is pasted here
>
> http://pastebin.com/PMRz204y
>
> -Sibi
>
>
>
> On Monday, July 21, 2014 12:27:18 PM UTC+5:30, shree wrote:
>
>> Sibi,
>>
>> I would suggest that you try tesseract by using a gui frontend such as
>> vietocr with the tamil training data provided by google (3.02 version is
>> the latest i think) to get an idea about how well it recognizes tamil.
>>
>> You can create your own training data using jtessboxeditor.
>>
>> More training tools and traineddata for other languages maybe forthcoming
>> during next few months, but no one knows when...
>>
>> Shree
>>
>>
>>
>> On Sun, Jul 20, 2014 at 10:07 PM, sibi kanagaraj <civil...@gmail.com>
>> wrote:
>>
>>> Hi ,
>>>
>>> Sorry for my delayed reply .
>>>
>>> Thank you Paul and Nick for your Inputs .
>>>
>>> @ Paul ,
>>>
>>> //imagery for doing training is not available. So basically you would
>>> have to start all over.//
>>>
>>> Starting all over in the sense ? I have put across the efforts taken by
>>> me in the mail . Is it  that the training process has to be started from
>>> the beginning ?
>>>
>>> @ Nick White
>>>
>>> //Can you give us some clue as to what you think could be improved
>>> about the current Tamil recognition? Changes of configuration  variables,
>>> or ambiguity rules (the unicharambigs file), don't need
>>> access to the training images. //
>>>
>>> I have for now only gone through the documents and not yet put my hands
>>> into the code or actual working of the engine . I am in my initial stages
>>> of analysis . I have got pretty good time( around 9 months )  to work on
>>> the project and would love to contribute to a project in Apache License and
>>> also in my Mother Tongue .
>>>
>>> “ The new page layout analysis for Tesseract  was designed from the
>>> beginning to be language-independent, but the rest of the engine was
>>> developed for English, without a great deal of thought as to how it might
>>> work for other languages.”[1]And in the training document for Tessaract its
>>> noted that  as “ .. the Tesseract was originally designed to recognize
>>> English text only. Efforts have been made to modify the engine and its
>>> training system to make them able to deal with other languages and UTF-8
>>> characters. Tesseract 3.0 can handle any Unicode characters (coded with
>>> UTF-8), but there are limits as to the range of languages that it will be
>>> successful with..” and  “..Tesseract needs to know about different shapes
>>> of the same character by having different fonts separated explicitly. ..”
>>> and “..Any language that has different punctuation and numbers is going to
>>> be disadvantaged by some of the hard-coded algorithms that assume ASCII
>>> punctuation and digits...”[2]
>>>
>>> [1]Ray Smith , Daria Antonova  , Dar-Shyang Lee Adapting the Tesseract
>>> open source OCR engine for multilingual OCR, Published by ACM 2009 Article.
>>> Bibliometrics Data Bibliometrics.
>>> [2]http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
>>>
>>> Tamil has almost all the above mentioned issues .
>>>
>>> I am wondering , where to start my learning process of the codes , where
>>> to test it , and other stuffs .
>>>
>>> -Sibi
>>> -
>>>
>>>
>>>
>>>
>>> On Wednesday, July 16, 2014 1:38:17 AM UTC+5:30, Nick White wrote:
>>>>
>>>> On Mon, Jul 14, 2014 at 11:36:46AM -0700, Paul wrote:
>>>> > Am Montag, 14. Juli 2014 10:07:59 UTC+2 schrieb sibi kanagaraj:
>>>> >     But , I feel that Tamil Training is not sufficient and it
>>>> >     could  be
>>>> >     streamlined . Hence I went to see if there are sufficient
>>>> training
>>>> >     documents for Tamil . This search  landed me to this page . And
>>>> >     subsequently I found  " Things I would NOT recommend working on"
>>>>  here .
>>>> >
>>>> >     I am little bit stuck here . I wanted to do this project as part
>>>> of my
>>>> >     Masters Degree . Isnt it that Tamil Training is independent
>>>> module that
>>>> >     could be worked upon ?
>>>> >
>>>> > I'm not sure what's the case for Tamil, but in general the imagery
>>>> for doing
>>>> > training is not available. So basically you would have to start all
>>>> over.
>>>>
>>>> Yes, that is the case, I'm afraid. There is a project that was
>>>> hoping to create improved trainings for South Asian languages, but
>>>> it hasn't been updated for quite a few years. See
>>>> http://code.google.com/p/parichit/
>>>>
>>>> Can you give us some clue as to what you think could be improved
>>>> about the current Tamil recognition? Changes of configuration
>>>> variables, or ambiguity rules (the unicharambigs file), don't need
>>>> access to the training images.
>>>>
>>>> Oh, by the way, the "Things I would NOT recommend working on" is a
>>>> very old page (from 2010); I wouldn't take it too seriously...
>>>>
>>>> Nick
>>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>>
>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/tesseract-ocr/d16e9c59-0802-4da0-add7-fb310da00479%
>>> 40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/d16e9c59-0802-4da0-add7-fb310da00479%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/c88ba5f3-b148-40ec-aaca-ac4b3962b890%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/c88ba5f3-b148-40ec-aaca-ac4b3962b890%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX%3Djd%3DjxpwE2PqKg50MhRRAm%2BgOgEx9TZuJi8i3zpWKtg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Regarding Tesseract OCR engine for recognizing Tamil Fonts

Reply via email to