Re: [tesseract-ocr] Re: Regarding Tesseract OCR engine for recognizing Tamil Fonts

Shree Devi Kumar Wed, 27 Aug 2014 10:09:44 -0700

Hi Sibi,

Please see http://vietocr.sourceforge.net/training.html
for details about jtessboxeditor. It requires Java Runtime Environment
<http://www.oracle.com/technetwork/java/javase/downloads/index.html> 6.0 or
later.


I have used it only on windows, but I guess it will run under ubuntu if you
have the java environment. Please check with Quan about it.

For tamil training source files sample, please download
http://sourceforge.net/projects/tesseracthindi/files/Tamil%20Training%20Files/tam.zip/download


Note that it is a large file (37 mb) as it has the sample tif/box pairs.
You can use the files as a start for tamil training.

I have not used the tamil training data provided with tesseract and cannot
comment on it. Possibly it is better than the sample file provided by me
because I just wanted to provide you with a framework for training with
Jtessboxeditor  to improve it.

BTW, I noticed that new language related files have been added to the
repository and you can get the tamil training text used by google at

https://code.google.com/p/tesseract-ocr/source/browse/tam/tam.training_text?spec=svn.langdata.9204c02c18daedaedc8aeaab1c1dd99e544cc932&repo=langdata&r=9204c02c18daedaedc8aeaab1c1dd99e544cc932

All training related files for tamil are at

https://code.google.com/p/tesseract-ocr/source/browse/tam/?repo=langdata&r=9204c02c18daedaedc8aeaab1c1dd99e544cc932

Hope this helps you.

Shree

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


On Wed, Aug 27, 2014 at 7:20 PM, sibi kanagaraj <civil.si...@gmail.com>
wrote:

> Hello Shree ,
>
> Thank you for the input .
>
> I have some doubts regarding it .
>
> 1.Is it possible to use jtessbox editor from GNU/Linux platform (Ubuntu)
> 2.How is it different or similar from the training data which has been
> prvided along with Tesseract-OCR .
>
> -Sibi
>
>
>
> On Thursday, August 7, 2014 5:44:23 PM UTC+5:30, Shree wrote:
>
>> Hello Sibi,
>>
>> Please see https://sourceforge.net/projects/tesseracthindi/files/
>> Tamil%20Training%20Files/?
>>
>> It has training files which can be used as start for Tamil script
>> training for Tesseract 3.02/03.
>> I am only familiar with the basics of tamil script hence these will
>> require changes and updates.
>>  tam.zip is a zip file with the traineddata, tif and box pairs and other
>> required files.
>> dir.txt lists alll the files available in the zip. These were produced
>> using Quan Nguyen's JTessBoxEditor and VIETOCR.
>>
>> jTessBoxEditor v1.0
>>
>>
>>    - Integrate support for full automation of Tesseract training
>>    - Bundle Tesseract Windows training executables (r866), English data,
>>    and config files
>>
>> VietOCR v4.0 Beta
>>
>>
>>    - Upgrade to Tesseract 3.03 RC (r1051)
>>
>> THANK YOU, QUAN, for the software and your prompt response to my queries.
>> Unicharambigs will require to be modified or postprocessing will be
>> required for the vowel signs which both prepend and append the consonants
>> i.e. பொ 0BCA TAMIL VOWEL SIGN O (combined with pa (ப)) போ 0BCB TAMIL VOWEL
>> SIGN OO (combined with pa (ப)) பௌ 0BCC TAMIL VOWEL SIGN AU (combined with
>> pa (ப)) Changes will also be required for distinguishing between ள 0BB3
>> TAMIL LETTER LLA and the last part of பௌ 0BCC TAMIL VOWEL SIGN AU (combined
>> with pa (ப)) The files include tam.traineddata which can be used with
>> VIETOCR to test OCR of tamil texts.
>>
>>
>>
>> Shree Devi Kumar
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>>
>> On Mon, Jul 21, 2014 at 2:45 PM, sibi kanagaraj <civil...@gmail.com>
>> wrote:
>>
>>>
>>> Hi Shree ,
>>>
>>> Thank you for the input .
>>>
>>> I have started testing the .png file for Tamil . I have used image from
>>> Tamil Text book .
>>>
>>> Though an entire page was given as input , I would like to paste the *most
>>> *accurate result which I got . I am sure that the deviation is quite
>>> large .
>>>
>>> The problem with me is that , I dont know where to start reading and
>>> working on code . I see FAQ and suddenly jump to modules , then from there
>>> 2.0 or 3.0 confusion and it keeps growing .
>>>
>>> For the given input shown above
>>>
>>> The output is pasted here
>>>
>>> http://pastebin.com/PMRz204y
>>>
>>> -Sibi
>>>
>>>
>>>
>>> On Monday, July 21, 2014 12:27:18 PM UTC+5:30, shree wrote:
>>>
>>>> Sibi,
>>>>
>>>> I would suggest that you try tesseract by using a gui frontend such as
>>>> vietocr with the tamil training data provided by google (3.02 version is
>>>> the latest i think) to get an idea about how well it recognizes tamil.
>>>>
>>>> You can create your own training data using jtessboxeditor.
>>>>
>>>> More training tools and traineddata for other languages maybe
>>>> forthcoming during next few months, but no one knows when...
>>>>
>>>> Shree
>>>>
>>>>
>>>>
>>>> On Sun, Jul 20, 2014 at 10:07 PM, sibi kanagaraj <civil...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi ,
>>>>>
>>>>> Sorry for my delayed reply .
>>>>>
>>>>> Thank you Paul and Nick for your Inputs .
>>>>>
>>>>> @ Paul ,
>>>>>
>>>>> //imagery for doing training is not available. So basically you would
>>>>> have to start all over.//
>>>>>
>>>>> Starting all over in the sense ? I have put across the efforts taken
>>>>> by me in the mail . Is it  that the training process has to be started 
>>>>> from
>>>>> the beginning ?
>>>>>
>>>>> @ Nick White
>>>>>
>>>>> //Can you give us some clue as to what you think could be improved
>>>>> about the current Tamil recognition? Changes of configuration  variables,
>>>>> or ambiguity rules (the unicharambigs file), don't need
>>>>> access to the training images. //
>>>>>
>>>>> I have for now only gone through the documents and not yet put my
>>>>> hands into the code or actual working of the engine . I am in my initial
>>>>> stages of analysis . I have got pretty good time( around 9 months )  to
>>>>> work on the project and would love to contribute to a project in Apache
>>>>> License and also in my Mother Tongue .
>>>>>
>>>>> “ The new page layout analysis for Tesseract  was designed from the
>>>>> beginning to be language-independent, but the rest of the engine was
>>>>> developed for English, without a great deal of thought as to how it might
>>>>> work for other languages.”[1]And in the training document for Tessaract 
>>>>> its
>>>>> noted that  as “ .. the Tesseract was originally designed to recognize
>>>>> English text only. Efforts have been made to modify the engine and its
>>>>> training system to make them able to deal with other languages and UTF-8
>>>>> characters. Tesseract 3.0 can handle any Unicode characters (coded with
>>>>> UTF-8), but there are limits as to the range of languages that it will be
>>>>> successful with..” and  “..Tesseract needs to know about different shapes
>>>>> of the same character by having different fonts separated explicitly. ..”
>>>>> and “..Any language that has different punctuation and numbers is going to
>>>>> be disadvantaged by some of the hard-coded algorithms that assume ASCII
>>>>> punctuation and digits...”[2]
>>>>>
>>>>> [1]Ray Smith , Daria Antonova  , Dar-Shyang Lee Adapting the Tesseract
>>>>> open source OCR engine for multilingual OCR, Published by ACM 2009 
>>>>> Article.
>>>>> Bibliometrics Data Bibliometrics.
>>>>> [2]http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
>>>>>
>>>>> Tamil has almost all the above mentioned issues .
>>>>>
>>>>> I am wondering , where to start my learning process of the codes ,
>>>>> where to test it , and other stuffs .
>>>>>
>>>>> -Sibi
>>>>> -
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wednesday, July 16, 2014 1:38:17 AM UTC+5:30, Nick White wrote:
>>>>>>
>>>>>> On Mon, Jul 14, 2014 at 11:36:46AM -0700, Paul wrote:
>>>>>> > Am Montag, 14. Juli 2014 10:07:59 UTC+2 schrieb sibi kanagaraj:
>>>>>> >     But , I feel that Tamil Training is not sufficient and it
>>>>>> >     could  be
>>>>>> >     streamlined . Hence I went to see if there are sufficient
>>>>>> training
>>>>>> >     documents for Tamil . This search  landed me to this page . And
>>>>>> >     subsequently I found  " Things I would NOT recommend working
>>>>>> on"  here .
>>>>>> >
>>>>>> >     I am little bit stuck here . I wanted to do this project as
>>>>>> part of my
>>>>>> >     Masters Degree . Isnt it that Tamil Training is independent
>>>>>> module that
>>>>>> >     could be worked upon ?
>>>>>> >
>>>>>> > I'm not sure what's the case for Tamil, but in general the imagery
>>>>>> for doing
>>>>>> > training is not available. So basically you would have to start all
>>>>>> over.
>>>>>>
>>>>>> Yes, that is the case, I'm afraid. There is a project that was
>>>>>> hoping to create improved trainings for South Asian languages, but
>>>>>> it hasn't been updated for quite a few years. See
>>>>>> http://code.google.com/p/parichit/
>>>>>>
>>>>>> Can you give us some clue as to what you think could be improved
>>>>>> about the current Tamil recognition? Changes of configuration
>>>>>> variables, or ambiguity rules (the unicharambigs file), don't need
>>>>>> access to the training images.
>>>>>>
>>>>>> Oh, by the way, the "Things I would NOT recommend working on" is a
>>>>>> very old page (from 2010); I wouldn't take it too seriously...
>>>>>>
>>>>>> Nick
>>>>>>
>>>>>  --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>
>>>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>> msgid/tesseract-ocr/d16e9c59-0802-4da0-add7-fb310da00479%40goo
>>>>> glegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/d16e9c59-0802-4da0-add7-fb310da00479%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/tesseract-ocr/c88ba5f3-b148-40ec-aaca-ac4b3962b890%
>>> 40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/c88ba5f3-b148-40ec-aaca-ac4b3962b890%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/e82f4564-397f-468a-9c94-b6db3e470131%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/e82f4564-397f-468a-9c94-b6db3e470131%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVSzT19_c1OvyFHAGdsmXnwooLay6FUOjuLP-jiM8r5wg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Regarding Tesseract OCR engine for recognizing Tamil Fonts

Reply via email to