@shree thank you for the advice, it was helpful. I managed to get everything working satisfactorily: after adding additional training images, I now get perfect results (446 pass, 0 fail)! Furthermore, these results come with using the built-in "eng" model. I ended up not needing to re-train or fine-tune Tesseract. The ticket was finding the magic sequence of image processing steps to perform on my source images to prepare them for input to Tesseract OCR
I have battled with this problem since your response and have come close to giving up more than once, thinking that perhaps Tesseract simply isn't up to the task. But the limited character set and the uniformity of the character appearances kept me going -- there just had to be a way to make this work. I'd love to document all the things I tried, and what results they gave, but there is just too much. A quick summary will have to suffice. *What got me close but ultimately didn't work* - Resized my images so the text was 36px in height. I did this in Python using OpenCV and (wrongly I think) chose the cv2.INTER_AREA interpolation method. - Tried different values for MAX_ITERATIONS in tesstrain's Makefile, and got varied results but nothing perfect. - Downloaded https://github.com/Shreeshrii/tessdata_shreetest/blob/master/digits_comma.traineddata and used it for the START_MODEL of tesstrain's Makefile (also had to set TESSDATA for the Makefile) - Between these things, the best result I ever got was something like this (input on left, OCR output on right): 21,485,000 -> 21,483,000 21,875,000 -> 21,873,000 24,995 -> 24,999 5,450,000 -> 9,450,000 591,958 -> 9591,958 851 -> 8571 851 -> 8571 Pass: 428 Fail: 7 - So you can see, close, but still some pretty unforgivable errors (unforgivable to me due to the nature of my application -- these numbers need to be perfect) *What ultimately did work* - In an act of desperation, and following a bit of a hunch, I abandoned trying to train/re-train/fine-tune, and just focused on getting perfect OCR on one of the images where it failed using "eng" model - I chose this file 1,000,000.png, which produced an empty string when ran through Tesseract - I used GIMP on Windows and opened 1,000,000.png and began adjusting/tweaking/filtering the image in various ways, each time re-trying the OCR to see if the result changed. Using GIMP was crucial because it allowed me to iterate through trying different image processing techniques using a GUI, which was much quicker than doing the same thing in Python using OpenCV. - Once I found what worked, I implemented it in Python. The magic steps ended up being: 1. Read the source image as color: image_to_ocr = cv2.imread(raw_image_file_name, cv2.IMREAD_COLOR) 2. Use only the green channel of the source image. The numbers in my source images are mostly green tinted and I thought maybe this would help. This results in a grayscale image with a dark background and white text: b, image_to_ocr, r = cv2.split(image_to_ocr) 3. Enlarge the image by 2x. This resulted in text that is ~20px in height, and I found this to be necessary but sufficient. I also found the use of cv2.INTER_CUBIC instead of cv2.INTER_AREA to be crucial here. I think the resizing (enlarging in my case) of the images was an absolute must-have. I'm really thankful I posted here and really thankful to @shree for that little nugget of insight. image_to_ocr = cv2.resize(image_to_ocr, (image_to_ocr.shape[1] * 2, image_to_ocr.shape[0] * 2), interpolation = cv2.INTER_CUBIC) 4. Invert the image so that the background is white and the text is black. I am not sure if this step was necessary. image_to_ocr = cv2.bitwise_not(image_to_ocr) - With these steps, 1,000,000.png OCR'd perfectly - I then re-ran my script to check accuracy on all 400+ source images, and got the perfect result. I was so nervous while the script was running; it prints out errors as it goes, and so many times before I'd run the script with eager anticipation that I'd finally gotten everything right, only to have an error appear. This time...it ran...seconds go by...more seconds go by...no errors...I can't look OMG...check back in 30 seconds, 446 pass, 0 fail, I literally stood up and hooped and hollered with arms raised. On Sunday, September 20, 2020 at 11:09:02 AM UTC-5 shree wrote: > Resize your images so that text is 36 pixels high. That's what is used for > eng models. > > Since you are fine tuning, limit number of iterations to 400 or so (not > 10000 which is default). > > Use dedug_level of -1 during training so that you can see the details per > iteration. > > > > On Sun, Sep 20, 2020, 00:24 Grad <kes...@gmail.com> wrote: > >> I have fixed my ground-truth file creator script to eliminate the >> badly-formed numbers and have re-run my experiment. Unfortunately, I am >> still seeing really poor results (12 pass, 383 fail), even though the >> training error rates appear to be much smaller this time around: >> >> At iteration 509/10000/10000, Mean rms=0.184%, delta=0.055%, char >> train=0.344%, word train=2.5%, skip ratio=0%, New worst char error = 0.344 >> wrote checkpoint. >> >> Finished! Error rate = 0.308 >> lstmtraining \ >> --stop_training \ >> --continue_from data/swtor/checkpoints/swtor_checkpoint \ >> --traineddata data/swtor/swtor.traineddata \ >> --model_output data/swtor.traineddata >> Loaded file data/swtor/checkpoints/swtor_checkpoint, unpacking... >> >> Full log of "make training" is attached. >> >> When I run Tesseract using the "eng" and "swtor" models on the training >> images, I'm seeing a the following types of results: >> >> "eng" model results for 638,997.png: >> >> > tesseract --psm 7 --oem 1 -c tessedit_char_whitelist=',0123456789' >> > 638,997.png >> out >> Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica >> Warning: Invalid resolution 0 dpi. Using 70 instead. >> > cat .\out.txt >> 638,997 >> >> "swtor" model results for 638,997.png: >> >> > tesseract --tessdata-dir -l swtor --psm 7 --oem 1 -c >> > tessedit_char_whitelist=',0123456789' >> 638,997.png out >> Failed to load any lstm-specific dictionaries for lang swtor!! >> Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica >> Warning: Invalid resolution 0 dpi. Using 70 instead. >> > cat .\out.txt >> 3,9,997 >> >> In general, digits are more erroneous, and there is a proliferation of >> commas. >> >> Do any other ideas come to mind? I appreciate your help Shree! >> >> On Saturday, September 19, 2020 at 12:12:19 PM UTC-5 Grad wrote: >> >>> If it turns out to be that simple, I will feel really relieved and >>> really stupid at the same time. I cannot believe I didn't catch this before >>> posting. Thank you for taking a look, I'll fix my ground-truth file creator >>> script and try again. >>> >>> On Saturday, September 19, 2020 at 12:01:50 PM UTC-5 shree wrote: >>> >>>> You will get better results when you fix your training data (I deleted >>>> all file names ending in -2 and -3). >>>> >>>> Mean rms=0.145%, delta=0.046%, train=0.214%(1.01%), skip ratio=0% >>>> Iteration 396: GROUND TRUTH : 5,500,000 >>>> File data/swtor-ground-truth/5,500,000.lstmf line 0 (Perfect): >>>> Mean rms=0.145%, delta=0.046%, train=0.214%(1.008%), skip ratio=0% >>>> Iteration 397: GROUND TRUTH : 2,000,000 >>>> File data/swtor-ground-truth/2,000,000.lstmf line 0 (Perfect): >>>> Mean rms=0.145%, delta=0.045%, train=0.213%(1.005%), skip ratio=0% >>>> Iteration 398: GROUND TRUTH : 6,435 >>>> File data/swtor-ground-truth/6,435.lstmf line 0 (Perfect): >>>> Mean rms=0.145%, delta=0.045%, train=0.213%(1.003%), skip ratio=0% >>>> Iteration 399: GROUND TRUTH : 3,750,000 >>>> File data/swtor-ground-truth/3,750,000.lstmf line 0 (Perfect): >>>> Mean rms=0.144%, delta=0.045%, train=0.212%(1%), skip ratio=0% >>>> 2 Percent improvement time=4, best error was 100 @ 0 >>>> At iteration 4/400/400, Mean rms=0.144%, delta=0.045%, char >>>> train=0.212%, word train=1%, skip ratio=0%, New best char error = 0.212 >>>> wrote best model:data/swtor/checkpoints/swtor_0.212_4_400.checkpoint wrote >>>> checkpoint. >>>> >>>> Iteration 400: GROUND TRUTH : 5,222,100 >>>> File data/swtor-ground-truth/5,222,100.lstmf line 0 (Perfect): >>>> Mean rms=0.144%, delta=0.045%, train=0.212%(0.998%), skip ratio=0% >>>> Iteration 401: GROUND TRUTH : 696,969 >>>> File data/swtor-ground-truth/696,969.lstmf line 0 (Perfect): >>>> Mean rms=0.144%, delta=0.045%, train=0.211%(0.995%), skip ratio=0% >>>> Iteration 402: GROUND TRUTH : 71,000,000 >>>> File data/swtor-ground-truth/71,000,000.lstmf line 0 (Perfect): >>>> Mean rms=0.144%, delta=0.045%, train=0.211%(0.993%), skip ratio=0% >>>> Iteration 403: GROUND TRUTH : 64,500 >>>> File data/swtor-ground-truth/64,500.lstmf line 0 (Perfect): >>>> Mean rms=0.144%, delta=0.045%, train=0.21%(0.99%), skip ratio=0% >>>> Iteration 404: GROUND TRUTH : 39,500,000 >>>> File data/swtor-ground-truth/39,500,000.lstmf line 0 (Perfect): >>>> Mean rms=0.144%, delta=0.045%, train=0.21%(0.988%), skip ratio=0% >>>> Iteration 405: GROUND TRUTH : 4,500,000 >>>> File data/swtor-ground-truth/4,500,000.lstmf line 0 (Perfect): >>>> Mean rms=0.143%, delta=0.045%, train=0.209%(0.985%), skip ratio=0% >>>> Iteration 406: GROUND TRUTH : 1,450,000 >>>> >>>> >>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> >>>> Virus-free. >>>> www.avg.com >>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> >>>> >>>> <#m_4573838550678158057_m_3745996810865765477_m_-8209654746249460667_m_-4693331455246237650_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> >>>> >>>> On Sat, Sep 19, 2020 at 10:15 PM Shree Devi Kumar <shree...@gmail.com> >>>> wrote: >>>> >>>>> > Each of my PNG files have file names that indicate ground truth, >>>>> and I have a little script that generates ground-truth TXT files from the >>>>> PNG file names. >>>>> >>>>> Please review your script. I notice a number of file names ending with >>>>> -2. The gt.txt files for the same also contain -2 while the image only >>>>> has >>>>> the number. >>>>> >>>>> Example files attached. >>>>> >>>>> >>>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> >>>>> Virus-free. >>>>> www.avg.com >>>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> >>>>> >>>>> <#m_4573838550678158057_m_3745996810865765477_m_-8209654746249460667_m_-4693331455246237650_m_2830491266519781149_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> >>>>> >>>> >>>> >>>> -- >>>> >>>> ____________________________________________________________ >>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>> >>> -- >> > You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-oc...@googlegroups.com. >> > To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/70e5fed6-3035-4885-965c-0552560ef0f6n%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/70e5fed6-3035-4885-965c-0552560ef0f6n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d1e0a335-2de8-4892-872f-e3459f695a19n%40googlegroups.com.