Sriranga, Sorry for the delay.
I meant that for training you just need to use as many as possible *different* images, not multiple renamed copies of the same image. Warm regards, Dmitri Silaev On Mon, Apr 4, 2011 at 2:56 PM, Sriranga(78yrsold) <[email protected]> wrote: > Dmitri, > I am extremely thankful for the valuable guidance. > With reference to your last para - I could not follow clearly and is in > confusion. Kindly eloborate little bit with your sample (any lang or > English) will do. Kindly pardon me for troubling you in the midst of your > hectic work. > With Choicest Best Wishes and Good Luck, > -sriranga(78yrs) > > On Mon, Apr 4, 2011 at 11:50 AM, Dmitri Silaev <[email protected]> > wrote: >> >> Dear Sriranga, >> >> Sorry for the delay. >> >> You indeed can manually set the DPI in an image file using any image >> editor, but the only thing that matters is the resolution your image >> got from the scanner. Roughly saying, the resolution here means the >> number of pixels per letter. This is controlled by the scanner itself >> or scanning program settings. By changing DPI afterwards in an image >> editor, you just change some image's attribute values, not image's >> pixels. >> >> 300 DPI is more than okay for your needs. >> >> Renaming a box/train file and feeding it to Tesseract as another >> sample is not a solution, as by "sample" we here mean a copy of a >> character we obtained at slightly different conditions in another >> [scanned] image, or at least at another position in the same image. So >> get as many images as possible, count the number of character samples >> within each and thus build your training body. >> >> Warm regards, >> Dmitri Silaev >> >> >> >> >> >> On Sat, Apr 2, 2011 at 1:13 PM, Sriranga(78yrsold) >> <[email protected]> wrote: >> > Dear Dimitri, >> > Awaiting your valuable guidance please. >> > With warmest regards, >> > -sriranga(78yrs) >> > >> > On Wed, Mar 30, 2011 at 8:29 PM, Sriranga(78yrsold) >> > <[email protected]> wrote: >> >> >> >> Dear Dimitri, >> >> It is presumed that if the scanned imges has 300 x 300 dpi is >> >> reasonable? >> >> With help of Irfanview I can find out dpi as well as increase or >> >> decrease >> >> dpi can be done. >> >> Generally,as a standard I select dpi =300 and resized to 1200 or 2400 >> >> from >> >> 600 which is convenient for edit the box file with help of owler. Hope >> >> this >> >> will not minimise accuracy of the output. Sample tif attached for >> >> approval. >> >> >> >> Regarding 20 samples of each char = Supose, if theimage1. tif file >> >> contains alphabets of single char can be used 20 times by renaming the >> >> same >> >> image file as image1.tif, image2.tif, image3.tif .....image20.tif ? If >> >> not >> >> kindly provide me with your sample, if any. >> >> With Warmest Regards, >> >> -sriranga(78yrs) >> >> >> >> On Wed, Mar 30, 2011 at 5:12 PM, Dmitri Silaev <[email protected]> >> >> wrote: >> >>> >> >>> Depending on the quality of your source images, I think it'd be >> >>> reasonable to scale them down in order for letters to have the height >> >>> of 40 pixels or so. In that way Tesseract will just have to do a bit >> >>> less work - scan lesser pixels and construct shorter glyph outlines. >> >>> >> >>> The accuracy may suffer even for such a considerable char height (90 >> >>> is certainly more than enough) if you have significant discrepancies >> >>> between training and source images. You should try to pass to >> >>> Tesseract images having as similar as possible thickness and >> >>> orientation. To achieve this, you need to pre-process images to get >> >>> them look alike with respect to lighting conditions, contrast, blur >> >>> amount, physical dimensions; rectify perspective distortion, etc. And >> >>> of course, always use the same binarization procedure with the same >> >>> parameter set, or at least giving predictably similar results for a >> >>> range of your source images. Btw, using Otsu thresholding prior to >> >>> passing images to Tesseract is useless as Otsu is a binarization >> >>> procedure employed by Tesseract itself. Except if you do Otsu with >> >>> your own special parameter set and then pass a 1-bit image. >> >>> >> >>> Next, you should train Tesseract having in mind that ideally there >> >>> should be around 20 samples of each char. You shouldn't be striving to >> >>> train using as many as possible char sizes - regardless of the size, >> >>> Tesseract scales character "models" up or down to the same internal >> >>> dimensions. But if your source char sizes differ - that's no problem, >> >>> they'll do. Provide real images (probably pre-processed) images for >> >>> training, not manually compiled ones. >> >>> >> >>> What can be done to further improve the speed and accuracy - process >> >>> your images char by char, bypassing Tesseract's layout analysis. This >> >>> approach also perfectly allows to use char-position-specific >> >>> whitelists (letters, digits) for even more speedup and precision. >> >>> >> >>> Everything related to Tesseract's dictionary facility is totally >> >>> irrelevant here. You'd better provide entirely empty files for your >> >>> "traineddata". >> >>> >> >>> HTH >> >>> >> >>> Warm regards, >> >>> Dmitri Silaev >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> On Tue, Mar 29, 2011 at 8:17 AM, Andres <[email protected]> wrote: >> >>> > ...required. >> >>> > >> >>> > Hello people, >> >>> > >> >>> > I'm develping a licence plate recognition system from long ago and I >> >>> > still >> >>> > have to improve the use of Tesseract to make it usable. >> >>> > >> >>> > My first concern is about speed: >> >>> > After extracting the licence plate image, I get an image like this: >> >>> > >> >>> > >> >>> > >> >>> > https://docs.google.com/leaf?id=0BxkuvS_LuBAzNmRkODhkYTUtNjcyYS00Nzg5LWE0ZDItNWM4YjRkYzhjYTFh&hl=en&authkey=CP-6tsgP >> >>> > >> >>> > As you may see, there are only 6 characters (tess is recognizing >> >>> > more >> >>> > because there are some blemishes over there, but I get rid of them >> >>> > with >> >>> > some >> >>> > postprocessing of the layout of the recognized chars) >> >>> > >> >>> > In an Intel I7 720 (good power, but using a single thread) the >> >>> > tesseract >> >>> > part is taking something like 230 ms. This is too much time for what >> >>> > I >> >>> > need. >> >>> > >> >>> > The image is 500 x 117 pixels. I noted that when I reduce the size >> >>> > of >> >>> > this >> >>> > image the detection time is reduced in proportion with the image >> >>> > area, >> >>> > which >> >>> > makes good sense. But the accuracy of the OCR is poor when the >> >>> > characters >> >>> > height is below 90 pixels. >> >>> > >> >>> > So, I assume that there is a problem with the way I trained >> >>> > tesseract. >> >>> > >> >>> > Because the characters in the plates are assorted (3 alphanumeric, 3 >> >>> > numeric) I trained it with just a single image with all the letters >> >>> > in >> >>> > the >> >>> > alphabet. I saw that you suggest large training but I imagine that >> >>> > that >> >>> > doesn't apply here where the characters are not organized in words. >> >>> > Am >> >>> > I >> >>> > correct with this ? >> >>> > >> >>> > So, for you to see, this is the image with what I trained Tesseract: >> >>> > >> >>> > >> >>> > >> >>> > https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0BxkuvS_LuBAzODc1YjIxNWUtNzIxMS00Yjg3LTljMDctNDkyZGIxZWM4YWVm&hl=en&authkey=CMXwo-AL >> >>> > >> >>> > In this image the characters are about 55 pixels height. >> >>> > >> >>> > Then, for frequent_word_list and words_list I included a single >> >>> > entry >> >>> > for >> >>> > each character, I mean, something starting with this: >> >>> > >> >>> > A >> >>> > B >> >>> > C >> >>> > D >> >>> > ... >> >>> > >> >>> > Do you see something to be improved on what I did ? Should I perhaps >> >>> > use a >> >>> > training image with more letters, with more combinations ? Will that >> >>> > help >> >>> > somehow ? >> >>> > >> >>> > Should I include in the same image a copy the same character set but >> >>> > with >> >>> > smaller size ? In that way, will I be able to pass Tesseract smaller >> >>> > images >> >>> > and get more speed without sacrificing detection quality ? >> >>> > >> >>> > >> >>> > On the other hand, I found some strange behavior of Tesseract about >> >>> > which I >> >>> > would like to know a little more: >> >>> > In my preprocessing I tried Otsu thresholding >> >>> > (http://en.wikipedia.org/wiki/Otsu%27s_method) and I visually got >> >>> > too >> >>> > much >> >>> > better results, but surprisingly for Tesseract it was worse. It >> >>> > decreased >> >>> > the thickness of the draw of the chars, and the chars I used to >> >>> > train >> >>> > Tesseract were bolder. So, Tesseract matches the "boldness" of the >> >>> > characters ? Should I train Tesseract with different levels of >> >>> > boldness >> >>> > ? >> >>> > >> >>> > I'm using Tesseract 2.04 for this. Do you think that some of these >> >>> > issues >> >>> > will go better by using Tess 3.0 ? >> >>> > >> >>> > >> >>> > Thanks, >> >>> > >> >>> > Andres >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > -- >> >>> > You received this message because you are subscribed to the Google >> >>> > Groups >> >>> > "tesseract-ocr" group. >> >>> > To post to this group, send email to [email protected]. >> >>> > To unsubscribe from this group, send email to >> >>> > [email protected]. >> >>> > For more options, visit this group at >> >>> > http://groups.google.com/group/tesseract-ocr?hl=en. >> >>> > >> >>> >> >>> -- >> >>> You received this message because you are subscribed to the Google >> >>> Groups >> >>> "tesseract-ocr" group. >> >>> To post to this group, send email to [email protected]. >> >>> To unsubscribe from this group, send email to >> >>> [email protected]. >> >>> For more options, visit this group at >> >>> http://groups.google.com/group/tesseract-ocr?hl=en. >> >>> >> >> >> > >> > > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

