Hi Sibi, Please see http://vietocr.sourceforge.net/training.html for details about jtessboxeditor. It requires Java Runtime Environment <http://www.oracle.com/technetwork/java/javase/downloads/index.html> 6.0 or later.
I have used it only on windows, but I guess it will run under ubuntu if you have the java environment. Please check with Quan about it. For tamil training source files sample, please download http://sourceforge.net/projects/tesseracthindi/files/Tamil%20Training%20Files/tam.zip/download Note that it is a large file (37 mb) as it has the sample tif/box pairs. You can use the files as a start for tamil training. I have not used the tamil training data provided with tesseract and cannot comment on it. Possibly it is better than the sample file provided by me because I just wanted to provide you with a framework for training with Jtessboxeditor to improve it. BTW, I noticed that new language related files have been added to the repository and you can get the tamil training text used by google at https://code.google.com/p/tesseract-ocr/source/browse/tam/tam.training_text?spec=svn.langdata.9204c02c18daedaedc8aeaab1c1dd99e544cc932&repo=langdata&r=9204c02c18daedaedc8aeaab1c1dd99e544cc932 All training related files for tamil are at https://code.google.com/p/tesseract-ocr/source/browse/tam/?repo=langdata&r=9204c02c18daedaedc8aeaab1c1dd99e544cc932 Hope this helps you. Shree Shree Devi Kumar ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Wed, Aug 27, 2014 at 7:20 PM, sibi kanagaraj <civil.si...@gmail.com> wrote: > Hello Shree , > > Thank you for the input . > > I have some doubts regarding it . > > 1.Is it possible to use jtessbox editor from GNU/Linux platform (Ubuntu) > 2.How is it different or similar from the training data which has been > prvided along with Tesseract-OCR . > > -Sibi > > > > On Thursday, August 7, 2014 5:44:23 PM UTC+5:30, Shree wrote: > >> Hello Sibi, >> >> Please see https://sourceforge.net/projects/tesseracthindi/files/ >> Tamil%20Training%20Files/? >> >> It has training files which can be used as start for Tamil script >> training for Tesseract 3.02/03. >> I am only familiar with the basics of tamil script hence these will >> require changes and updates. >> tam.zip is a zip file with the traineddata, tif and box pairs and other >> required files. >> dir.txt lists alll the files available in the zip. These were produced >> using Quan Nguyen's JTessBoxEditor and VIETOCR. >> >> jTessBoxEditor v1.0 >> >> >> - Integrate support for full automation of Tesseract training >> - Bundle Tesseract Windows training executables (r866), English data, >> and config files >> >> VietOCR v4.0 Beta >> >> >> - Upgrade to Tesseract 3.03 RC (r1051) >> >> THANK YOU, QUAN, for the software and your prompt response to my queries. >> Unicharambigs will require to be modified or postprocessing will be >> required for the vowel signs which both prepend and append the consonants >> i.e. பொ 0BCA TAMIL VOWEL SIGN O (combined with pa (ப)) போ 0BCB TAMIL VOWEL >> SIGN OO (combined with pa (ப)) பௌ 0BCC TAMIL VOWEL SIGN AU (combined with >> pa (ப)) Changes will also be required for distinguishing between ள 0BB3 >> TAMIL LETTER LLA and the last part of பௌ 0BCC TAMIL VOWEL SIGN AU (combined >> with pa (ப)) The files include tam.traineddata which can be used with >> VIETOCR to test OCR of tamil texts. >> >> >> >> Shree Devi Kumar >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> >> >> On Mon, Jul 21, 2014 at 2:45 PM, sibi kanagaraj <civil...@gmail.com> >> wrote: >> >>> >>> Hi Shree , >>> >>> Thank you for the input . >>> >>> I have started testing the .png file for Tamil . I have used image from >>> Tamil Text book . >>> >>> Though an entire page was given as input , I would like to paste the *most >>> *accurate result which I got . I am sure that the deviation is quite >>> large . >>> >>> The problem with me is that , I dont know where to start reading and >>> working on code . I see FAQ and suddenly jump to modules , then from there >>> 2.0 or 3.0 confusion and it keeps growing . >>> >>> For the given input shown above >>> >>> The output is pasted here >>> >>> http://pastebin.com/PMRz204y >>> >>> -Sibi >>> >>> >>> >>> On Monday, July 21, 2014 12:27:18 PM UTC+5:30, shree wrote: >>> >>>> Sibi, >>>> >>>> I would suggest that you try tesseract by using a gui frontend such as >>>> vietocr with the tamil training data provided by google (3.02 version is >>>> the latest i think) to get an idea about how well it recognizes tamil. >>>> >>>> You can create your own training data using jtessboxeditor. >>>> >>>> More training tools and traineddata for other languages maybe >>>> forthcoming during next few months, but no one knows when... >>>> >>>> Shree >>>> >>>> >>>> >>>> On Sun, Jul 20, 2014 at 10:07 PM, sibi kanagaraj <civil...@gmail.com> >>>> wrote: >>>> >>>>> Hi , >>>>> >>>>> Sorry for my delayed reply . >>>>> >>>>> Thank you Paul and Nick for your Inputs . >>>>> >>>>> @ Paul , >>>>> >>>>> //imagery for doing training is not available. So basically you would >>>>> have to start all over.// >>>>> >>>>> Starting all over in the sense ? I have put across the efforts taken >>>>> by me in the mail . Is it that the training process has to be started >>>>> from >>>>> the beginning ? >>>>> >>>>> @ Nick White >>>>> >>>>> //Can you give us some clue as to what you think could be improved >>>>> about the current Tamil recognition? Changes of configuration variables, >>>>> or ambiguity rules (the unicharambigs file), don't need >>>>> access to the training images. // >>>>> >>>>> I have for now only gone through the documents and not yet put my >>>>> hands into the code or actual working of the engine . I am in my initial >>>>> stages of analysis . I have got pretty good time( around 9 months ) to >>>>> work on the project and would love to contribute to a project in Apache >>>>> License and also in my Mother Tongue . >>>>> >>>>> “ The new page layout analysis for Tesseract was designed from the >>>>> beginning to be language-independent, but the rest of the engine was >>>>> developed for English, without a great deal of thought as to how it might >>>>> work for other languages.”[1]And in the training document for Tessaract >>>>> its >>>>> noted that as “ .. the Tesseract was originally designed to recognize >>>>> English text only. Efforts have been made to modify the engine and its >>>>> training system to make them able to deal with other languages and UTF-8 >>>>> characters. Tesseract 3.0 can handle any Unicode characters (coded with >>>>> UTF-8), but there are limits as to the range of languages that it will be >>>>> successful with..” and “..Tesseract needs to know about different shapes >>>>> of the same character by having different fonts separated explicitly. ..” >>>>> and “..Any language that has different punctuation and numbers is going to >>>>> be disadvantaged by some of the hard-coded algorithms that assume ASCII >>>>> punctuation and digits...”[2] >>>>> >>>>> [1]Ray Smith , Daria Antonova , Dar-Shyang Lee Adapting the Tesseract >>>>> open source OCR engine for multilingual OCR, Published by ACM 2009 >>>>> Article. >>>>> Bibliometrics Data Bibliometrics. >>>>> [2]http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 >>>>> >>>>> Tamil has almost all the above mentioned issues . >>>>> >>>>> I am wondering , where to start my learning process of the codes , >>>>> where to test it , and other stuffs . >>>>> >>>>> -Sibi >>>>> - >>>>> >>>>> >>>>> >>>>> >>>>> On Wednesday, July 16, 2014 1:38:17 AM UTC+5:30, Nick White wrote: >>>>>> >>>>>> On Mon, Jul 14, 2014 at 11:36:46AM -0700, Paul wrote: >>>>>> > Am Montag, 14. Juli 2014 10:07:59 UTC+2 schrieb sibi kanagaraj: >>>>>> > But , I feel that Tamil Training is not sufficient and it >>>>>> > could be >>>>>> > streamlined . Hence I went to see if there are sufficient >>>>>> training >>>>>> > documents for Tamil . This search landed me to this page . And >>>>>> > subsequently I found " Things I would NOT recommend working >>>>>> on" here . >>>>>> > >>>>>> > I am little bit stuck here . I wanted to do this project as >>>>>> part of my >>>>>> > Masters Degree . Isnt it that Tamil Training is independent >>>>>> module that >>>>>> > could be worked upon ? >>>>>> > >>>>>> > I'm not sure what's the case for Tamil, but in general the imagery >>>>>> for doing >>>>>> > training is not available. So basically you would have to start all >>>>>> over. >>>>>> >>>>>> Yes, that is the case, I'm afraid. There is a project that was >>>>>> hoping to create improved trainings for South Asian languages, but >>>>>> it hasn't been updated for quite a few years. See >>>>>> http://code.google.com/p/parichit/ >>>>>> >>>>>> Can you give us some clue as to what you think could be improved >>>>>> about the current Tamil recognition? Changes of configuration >>>>>> variables, or ambiguity rules (the unicharambigs file), don't need >>>>>> access to the training images. >>>>>> >>>>>> Oh, by the way, the "Things I would NOT recommend working on" is a >>>>>> very old page (from 2010); I wouldn't take it too seriously... >>>>>> >>>>>> Nick >>>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesseract-oc...@googlegroups.com. >>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>> >>>>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit https://groups.google.com/d/ >>>>> msgid/tesseract-ocr/d16e9c59-0802-4da0-add7-fb310da00479%40goo >>>>> glegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/d16e9c59-0802-4da0-add7-fb310da00479%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com. >>> To post to this group, send email to tesser...@googlegroups.com. >>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit https://groups.google.com/d/ >>> msgid/tesseract-ocr/c88ba5f3-b148-40ec-aaca-ac4b3962b890% >>> 40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/c88ba5f3-b148-40ec-aaca-ac4b3962b890%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/e82f4564-397f-468a-9c94-b6db3e470131%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/e82f4564-397f-468a-9c94-b6db3e470131%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVSzT19_c1OvyFHAGdsmXnwooLay6FUOjuLP-jiM8r5wg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.