Re: [tesseract-ocr] Re: Regarding Tesseract OCR engine for recognizing Tamil Fonts

Shree Devi Kumar Sun, 20 Jul 2014 23:58:18 -0700

Sibi,

I would suggest that you try tesseract by using a gui frontend such as
vietocr with the tamil training data provided by google (3.02 version is
the latest i think) to get an idea about how well it recognizes tamil.


You can create your own training data using jtessboxeditor.

More training tools and traineddata for other languages maybe forthcoming
during next few months, but no one knows when...

Shree



On Sun, Jul 20, 2014 at 10:07 PM, sibi kanagaraj <[email protected]>
wrote:

> Hi ,
>
> Sorry for my delayed reply .
>
> Thank you Paul and Nick for your Inputs .
>
> @ Paul ,
>
> //imagery for doing training is not available. So basically you would have
> to start all over.//
>
> Starting all over in the sense ? I have put across the efforts taken by me
> in the mail . Is it  that the training process has to be started from the
> beginning ?
>
> @ Nick White
>
> //Can you give us some clue as to what you think could be improved  about
> the current Tamil recognition? Changes of configuration  variables, or
> ambiguity rules (the unicharambigs file), don't need
> access to the training images. //
>
> I have for now only gone through the documents and not yet put my hands
> into the code or actual working of the engine . I am in my initial stages
> of analysis . I have got pretty good time( around 9 months )  to work on
> the project and would love to contribute to a project in Apache License and
> also in my Mother Tongue .
>
> “ The new page layout analysis for Tesseract  was designed from the
> beginning to be language-independent, but the rest of the engine was
> developed for English, without a great deal of thought as to how it might
> work for other languages.”[1]And in the training document for Tessaract its
> noted that  as “ .. the Tesseract was originally designed to recognize
> English text only. Efforts have been made to modify the engine and its
> training system to make them able to deal with other languages and UTF-8
> characters. Tesseract 3.0 can handle any Unicode characters (coded with
> UTF-8), but there are limits as to the range of languages that it will be
> successful with..” and  “..Tesseract needs to know about different shapes
> of the same character by having different fonts separated explicitly. ..”
> and “..Any language that has different punctuation and numbers is going to
> be disadvantaged by some of the hard-coded algorithms that assume ASCII
> punctuation and digits...”[2]
>
> [1]Ray Smith , Daria Antonova  , Dar-Shyang Lee Adapting the Tesseract
> open source OCR engine for multilingual OCR, Published by ACM 2009 Article.
> Bibliometrics Data Bibliometrics.
> [2]http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
>
> Tamil has almost all the above mentioned issues .
>
> I am wondering , where to start my learning process of the codes , where
> to test it , and other stuffs .
>
> -Sibi
> -
>
>
>
>
> On Wednesday, July 16, 2014 1:38:17 AM UTC+5:30, Nick White wrote:
>>
>> On Mon, Jul 14, 2014 at 11:36:46AM -0700, Paul wrote:
>> > Am Montag, 14. Juli 2014 10:07:59 UTC+2 schrieb sibi kanagaraj:
>> >     But , I feel that Tamil Training is not sufficient and it
>> >     could  be
>> >     streamlined . Hence I went to see if there are sufficient training
>> >     documents for Tamil . This search  landed me to this page . And
>> >     subsequently I found  " Things I would NOT recommend working on"
>>  here .
>> >
>> >     I am little bit stuck here . I wanted to do this project as part of
>> my
>> >     Masters Degree . Isnt it that Tamil Training is independent module
>> that
>> >     could be worked upon ?
>> >
>> > I'm not sure what's the case for Tamil, but in general the imagery for
>> doing
>> > training is not available. So basically you would have to start all
>> over.
>>
>> Yes, that is the case, I'm afraid. There is a project that was
>> hoping to create improved trainings for South Asian languages, but
>> it hasn't been updated for quite a few years. See
>> http://code.google.com/p/parichit/
>>
>> Can you give us some clue as to what you think could be improved
>> about the current Tamil recognition? Changes of configuration
>> variables, or ambiguity rules (the unicharambigs file), don't need
>> access to the training images.
>>
>> Oh, by the way, the "Things I would NOT recommend working on" is a
>> very old page (from 2010); I wouldn't take it too seriously...
>>
>> Nick
>>
>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/d16e9c59-0802-4da0-add7-fb310da00479%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/d16e9c59-0802-4da0-add7-fb310da00479%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUx0mOdSyXAdEu4MrvJ4hdA8uyvh_49cQAVhDA9_zoxSg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Regarding Tesseract OCR engine for recognizing Tamil Fonts

Reply via email to