Hi. I have a similar goal in mind about finetuning the 'ben' traineddata with the pictures i am working with. The picture will be an id so the names of people have to be recognized correctly. I tried the (line image,ground truth) way of finetuning the traineddata with very small number of images. The result was not good- I was kinda surprised as i expected at least the performance of the default model. My question is if i have a substantial amount of images and then process and produce the line image and ground truth from it- will that help me in improving the detection?
On Sunday, September 27, 2020 at 9:21:17 PM UTC+6 Grad wrote: > @shree thank you for the advice, it was helpful. I managed to get > everything working satisfactorily: after adding additional training images, > I now get perfect results (446 pass, 0 fail)! Furthermore, these results > come with using the built-in "eng" model. I ended up not needing to > re-train or fine-tune Tesseract. The ticket was finding the magic sequence > of image processing steps to perform on my source images to prepare them > for input to Tesseract OCR > > I have battled with this problem since your response and have come close > to giving up more than once, thinking that perhaps Tesseract simply isn't > up to the task. But the limited character set and the uniformity of the > character appearances kept me going -- there just had to be a way to make > this work. I'd love to document all the things I tried, and what results > they gave, but there is just too much. A quick summary will have to suffice. > > *What got me close but ultimately didn't work* > > - Resized my images so the text was 36px in height. I did this in > Python using OpenCV and (wrongly I think) chose the cv2.INTER_AREA > interpolation method. > - Tried different values for MAX_ITERATIONS in tesstrain's Makefile, > and got varied results but nothing perfect. > - Downloaded > > https://github.com/Shreeshrii/tessdata_shreetest/blob/master/digits_comma.traineddata > > and used it for the START_MODEL of tesstrain's Makefile (also had to set > TESSDATA for the Makefile) > - Between these things, the best result I ever got was something like > this (input on left, OCR output on right): > 21,485,000 -> 21,483,000 > 21,875,000 -> 21,873,000 > 24,995 -> 24,999 > 5,450,000 -> 9,450,000 > 591,958 -> 9591,958 > 851 -> 8571 > 851 -> 8571 > Pass: 428 > Fail: 7 > - So you can see, close, but still some pretty unforgivable errors > (unforgivable to me due to the nature of my application -- these numbers > need to be perfect) > > *What ultimately did work* > > - In an act of desperation, and following a bit of a hunch, I > abandoned trying to train/re-train/fine-tune, and just focused on getting > perfect OCR on one of the images where it failed using "eng" model > - I chose this file 1,000,000.png, which produced an empty string > when ran through Tesseract > - I used GIMP on Windows and opened 1,000,000.png and began > adjusting/tweaking/filtering the image in various ways, each time > re-trying > the OCR to see if the result changed. Using GIMP was crucial because it > allowed me to iterate through trying different image processing techniques > using a GUI, which was much quicker than doing the same thing in Python > using OpenCV. > - Once I found what worked, I implemented it in Python. The magic > steps ended up being: > 1. Read the source image as color: > image_to_ocr = cv2.imread(raw_image_file_name, cv2.IMREAD_COLOR) > 2. Use only the green channel of the source image. The numbers in > my source images are mostly green tinted and I thought maybe this would > help. This results in a grayscale image with a dark background and > white > text: > b, image_to_ocr, r = cv2.split(image_to_ocr) > 3. Enlarge the image by 2x. This resulted in text that is ~20px in > height, and I found this to be necessary but sufficient. I also found > the > use of cv2.INTER_CUBIC instead of cv2.INTER_AREA to be crucial here. I > think the resizing (enlarging in my case) of the images was an absolute > must-have. I'm really thankful I posted here and really thankful to > @shree > for that little nugget of insight. > image_to_ocr = cv2.resize(image_to_ocr, (image_to_ocr.shape[1] * 2, > image_to_ocr.shape[0] * 2), interpolation = cv2.INTER_CUBIC) > 4. Invert the image so that the background is white and the text is > black. I am not sure if this step was necessary. > image_to_ocr = cv2.bitwise_not(image_to_ocr) > - With these steps, 1,000,000.png OCR'd perfectly > - I then re-ran my script to check accuracy on all 400+ source images, > and got the perfect result. I was so nervous while the script was running; > it prints out errors as it goes, and so many times before I'd run the > script with eager anticipation that I'd finally gotten everything right, > only to have an error appear. This time...it ran...seconds go by...more > seconds go by...no errors...I can't look OMG...check back in 30 seconds, > 446 pass, 0 fail, I literally stood up and hooped and hollered with arms > raised. > > > On Sunday, September 20, 2020 at 11:09:02 AM UTC-5 shree wrote: > >> Resize your images so that text is 36 pixels high. That's what is used >> for eng models. >> >> Since you are fine tuning, limit number of iterations to 400 or so (not >> 10000 which is default). >> >> Use dedug_level of -1 during training so that you can see the details per >> iteration. >> >> >> >> On Sun, Sep 20, 2020, 00:24 Grad <kes...@gmail.com> wrote: >> >>> I have fixed my ground-truth file creator script to eliminate the >>> badly-formed numbers and have re-run my experiment. Unfortunately, I am >>> still seeing really poor results (12 pass, 383 fail), even though the >>> training error rates appear to be much smaller this time around: >>> >>> At iteration 509/10000/10000, Mean rms=0.184%, delta=0.055%, char >>> train=0.344%, word train=2.5%, skip ratio=0%, New worst char error = 0.344 >>> wrote checkpoint. >>> >>> Finished! Error rate = 0.308 >>> lstmtraining \ >>> --stop_training \ >>> --continue_from data/swtor/checkpoints/swtor_checkpoint \ >>> --traineddata data/swtor/swtor.traineddata \ >>> --model_output data/swtor.traineddata >>> Loaded file data/swtor/checkpoints/swtor_checkpoint, unpacking... >>> >>> Full log of "make training" is attached. >>> >>> When I run Tesseract using the "eng" and "swtor" models on the training >>> images, I'm seeing a the following types of results: >>> >>> "eng" model results for 638,997.png: >>> >>> > tesseract --psm 7 --oem 1 -c tessedit_char_whitelist=',0123456789' >>> > 638,997.png >>> out >>> Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica >>> Warning: Invalid resolution 0 dpi. Using 70 instead. >>> > cat .\out.txt >>> 638,997 >>> >>> "swtor" model results for 638,997.png: >>> >>> > tesseract --tessdata-dir -l swtor --psm 7 --oem 1 -c >>> > tessedit_char_whitelist=',0123456789' >>> 638,997.png out >>> Failed to load any lstm-specific dictionaries for lang swtor!! >>> Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica >>> Warning: Invalid resolution 0 dpi. Using 70 instead. >>> > cat .\out.txt >>> 3,9,997 >>> >>> In general, digits are more erroneous, and there is a proliferation of >>> commas. >>> >>> Do any other ideas come to mind? I appreciate your help Shree! >>> >>> On Saturday, September 19, 2020 at 12:12:19 PM UTC-5 Grad wrote: >>> >>>> If it turns out to be that simple, I will feel really relieved and >>>> really stupid at the same time. I cannot believe I didn't catch this >>>> before >>>> posting. Thank you for taking a look, I'll fix my ground-truth file >>>> creator >>>> script and try again. >>>> >>>> On Saturday, September 19, 2020 at 12:01:50 PM UTC-5 shree wrote: >>>> >>>>> You will get better results when you fix your training data (I deleted >>>>> all file names ending in -2 and -3). >>>>> >>>>> Mean rms=0.145%, delta=0.046%, train=0.214%(1.01%), skip ratio=0% >>>>> Iteration 396: GROUND TRUTH : 5,500,000 >>>>> File data/swtor-ground-truth/5,500,000.lstmf line 0 (Perfect): >>>>> Mean rms=0.145%, delta=0.046%, train=0.214%(1.008%), skip ratio=0% >>>>> Iteration 397: GROUND TRUTH : 2,000,000 >>>>> File data/swtor-ground-truth/2,000,000.lstmf line 0 (Perfect): >>>>> Mean rms=0.145%, delta=0.045%, train=0.213%(1.005%), skip ratio=0% >>>>> Iteration 398: GROUND TRUTH : 6,435 >>>>> File data/swtor-ground-truth/6,435.lstmf line 0 (Perfect): >>>>> Mean rms=0.145%, delta=0.045%, train=0.213%(1.003%), skip ratio=0% >>>>> Iteration 399: GROUND TRUTH : 3,750,000 >>>>> File data/swtor-ground-truth/3,750,000.lstmf line 0 (Perfect): >>>>> Mean rms=0.144%, delta=0.045%, train=0.212%(1%), skip ratio=0% >>>>> 2 Percent improvement time=4, best error was 100 @ 0 >>>>> At iteration 4/400/400, Mean rms=0.144%, delta=0.045%, char >>>>> train=0.212%, word train=1%, skip ratio=0%, New best char error = 0.212 >>>>> wrote best model:data/swtor/checkpoints/swtor_0.212_4_400.checkpoint >>>>> wrote >>>>> checkpoint. >>>>> >>>>> Iteration 400: GROUND TRUTH : 5,222,100 >>>>> File data/swtor-ground-truth/5,222,100.lstmf line 0 (Perfect): >>>>> Mean rms=0.144%, delta=0.045%, train=0.212%(0.998%), skip ratio=0% >>>>> Iteration 401: GROUND TRUTH : 696,969 >>>>> File data/swtor-ground-truth/696,969.lstmf line 0 (Perfect): >>>>> Mean rms=0.144%, delta=0.045%, train=0.211%(0.995%), skip ratio=0% >>>>> Iteration 402: GROUND TRUTH : 71,000,000 >>>>> File data/swtor-ground-truth/71,000,000.lstmf line 0 (Perfect): >>>>> Mean rms=0.144%, delta=0.045%, train=0.211%(0.993%), skip ratio=0% >>>>> Iteration 403: GROUND TRUTH : 64,500 >>>>> File data/swtor-ground-truth/64,500.lstmf line 0 (Perfect): >>>>> Mean rms=0.144%, delta=0.045%, train=0.21%(0.99%), skip ratio=0% >>>>> Iteration 404: GROUND TRUTH : 39,500,000 >>>>> File data/swtor-ground-truth/39,500,000.lstmf line 0 (Perfect): >>>>> Mean rms=0.144%, delta=0.045%, train=0.21%(0.988%), skip ratio=0% >>>>> Iteration 405: GROUND TRUTH : 4,500,000 >>>>> File data/swtor-ground-truth/4,500,000.lstmf line 0 (Perfect): >>>>> Mean rms=0.143%, delta=0.045%, train=0.209%(0.985%), skip ratio=0% >>>>> Iteration 406: GROUND TRUTH : 1,450,000 >>>>> >>>>> >>>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> >>>>> Virus-free. >>>>> www.avg.com >>>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> >>>>> >>>>> <#m_-1362665791027190050_m_4573838550678158057_m_3745996810865765477_m_-8209654746249460667_m_-4693331455246237650_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> >>>>> >>>>> On Sat, Sep 19, 2020 at 10:15 PM Shree Devi Kumar <shree...@gmail.com> >>>>> wrote: >>>>> >>>>>> > Each of my PNG files have file names that indicate ground truth, >>>>>> and I have a little script that generates ground-truth TXT files from >>>>>> the >>>>>> PNG file names. >>>>>> >>>>>> Please review your script. I notice a number of file names ending >>>>>> with -2. The gt.txt files for the same also contain -2 while the image >>>>>> only >>>>>> has the number. >>>>>> >>>>>> Example files attached. >>>>>> >>>>>> >>>>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> >>>>>> Virus-free. >>>>>> www.avg.com >>>>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> >>>>>> >>>>>> <#m_-1362665791027190050_m_4573838550678158057_m_3745996810865765477_m_-8209654746249460667_m_-4693331455246237650_m_2830491266519781149_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> ____________________________________________________________ >>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>> >>>> -- >>> >> You received this message because you are subscribed to the Google Groups >>> "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com. >>> >> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/70e5fed6-3035-4885-965c-0552560ef0f6n%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/tesseract-ocr/70e5fed6-3035-4885-965c-0552560ef0f6n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f20fef2a-367c-4b10-b1b5-f8349679b4edn%40googlegroups.com.