Re: [tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

Grad Sun, 27 Sep 2020 08:22:08 -0700

@shree thank you for the advice, it was helpful. I managed to get 
everything working satisfactorily: after adding additional training images, 
I now get perfect results (446 pass, 0 fail)! Furthermore, these results 
come with using the built-in "eng" model. I ended up not needing to 
re-train or fine-tune Tesseract. The ticket was finding the magic sequence 
of image processing steps to perform on my source images to prepare them 
for input to Tesseract OCR


I have battled with this problem since your response and have come close to 
giving up more than once, thinking that perhaps Tesseract simply isn't up 
to the task. But the limited character set and the uniformity of the 
character appearances kept me going -- there just had to be a way to make 
this work. I'd love to document all the things I tried, and what results 
they gave, but there is just too much. A quick summary will have to suffice.

*What got me close but ultimately didn't work*

   - Resized my images so the text was 36px in height. I did this in Python 
   using OpenCV and (wrongly I think) chose the cv2.INTER_AREA interpolation 
   method.
   - Tried different values for MAX_ITERATIONS in tesstrain's Makefile, and 
   got varied results but nothing perfect.
   - Downloaded 
   
https://github.com/Shreeshrii/tessdata_shreetest/blob/master/digits_comma.traineddata
 
   and used it for the START_MODEL of tesstrain's Makefile (also had to set 
   TESSDATA for the Makefile)
   - Between these things, the best result I ever got was something like 
   this (input on left, OCR output on right):
   21,485,000 -> 21,483,000
   21,875,000 -> 21,873,000
   24,995 -> 24,999
   5,450,000 -> 9,450,000
   591,958 -> 9591,958
   851 -> 8571
   851 -> 8571
   Pass: 428
   Fail: 7
   - So you can see, close, but still some pretty unforgivable errors 
   (unforgivable to me due to the nature of my application -- these numbers 
   need to be perfect)

*What ultimately did work*

   - In an act of desperation, and following a bit of a hunch, I abandoned 
   trying to train/re-train/fine-tune, and just focused on getting perfect OCR 
   on one of the images where it failed using "eng" model
      - I chose this file 1,000,000.png, which produced an empty string 
      when ran through Tesseract
      - I used GIMP on Windows and opened 1,000,000.png and began 
   adjusting/tweaking/filtering the image in various ways, each time re-trying 
   the OCR to see if the result changed. Using GIMP was crucial because it 
   allowed me to iterate through trying different image processing techniques 
   using a GUI, which was much quicker than doing the same thing in Python 
   using OpenCV.
   - Once I found what worked, I implemented it in Python. The magic steps 
   ended up being:
      1. Read the source image as color:
      image_to_ocr = cv2.imread(raw_image_file_name, cv2.IMREAD_COLOR)
      2. Use only the green channel of the source image. The numbers in my 
      source images are mostly green tinted and I thought maybe this would 
help. 
      This results in a grayscale image with a dark background and white text:
      b, image_to_ocr, r = cv2.split(image_to_ocr)
      3. Enlarge the image by 2x. This resulted in text that is ~20px in 
      height, and I found this to be necessary but sufficient. I also found the 
      use of cv2.INTER_CUBIC instead of cv2.INTER_AREA to be crucial here. I 
      think the resizing (enlarging in my case) of the images was an absolute 
      must-have. I'm really thankful I posted here and really thankful to 
@shree 
      for that little nugget of insight. 
      image_to_ocr = cv2.resize(image_to_ocr, (image_to_ocr.shape[1] * 2, 
      image_to_ocr.shape[0] * 2), interpolation = cv2.INTER_CUBIC)
      4. Invert the image so that the background is white and the text is 
      black. I am not sure if this step was necessary.
      image_to_ocr = cv2.bitwise_not(image_to_ocr)
      - With these steps, 1,000,000.png OCR'd perfectly
   - I then re-ran my script to check accuracy on all 400+ source images, 
   and got the perfect result. I was so nervous while the script was running; 
   it prints out errors as it goes, and so many times before I'd run the 
   script with eager anticipation that I'd finally gotten everything right, 
   only to have an error appear. This time...it ran...seconds go by...more 
   seconds go by...no errors...I can't look OMG...check back in 30 seconds, 
   446 pass, 0 fail, I literally stood up and hooped and hollered with arms 
   raised.
   

On Sunday, September 20, 2020 at 11:09:02 AM UTC-5 shree wrote:

> Resize your images so that text is 36 pixels high. That's what is used for 
> eng models.
>
> Since you are fine tuning, limit number of iterations to 400 or so (not 
> 10000 which is default).
>
> Use dedug_level of -1 during training so that you can see the details per 
> iteration.
>
>
>
> On Sun, Sep 20, 2020, 00:24 Grad <kes...@gmail.com> wrote:
>
>> I have fixed my ground-truth file creator script to eliminate the 
>> badly-formed numbers and have re-run my experiment. Unfortunately, I am 
>> still seeing really poor results (12 pass, 383 fail), even though the 
>> training error rates appear to be much smaller this time around:
>>
>> At iteration 509/10000/10000, Mean rms=0.184%, delta=0.055%, char 
>> train=0.344%, word train=2.5%, skip ratio=0%,  New worst char error = 0.344 
>> wrote checkpoint.
>>
>> Finished! Error rate = 0.308
>> lstmtraining \
>> --stop_training \
>> --continue_from data/swtor/checkpoints/swtor_checkpoint \
>> --traineddata data/swtor/swtor.traineddata \
>> --model_output data/swtor.traineddata
>> Loaded file data/swtor/checkpoints/swtor_checkpoint, unpacking...
>>
>> Full log of "make training" is attached.
>>
>> When I run Tesseract using the "eng" and "swtor" models on the training 
>> images, I'm seeing a the following types of results:
>>
>> "eng" model results for 638,997.png:
>>
>> > tesseract --psm 7 --oem 1 -c tessedit_char_whitelist=',0123456789' 
>> > 638,997.png 
>> out
>> Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica
>> Warning: Invalid resolution 0 dpi. Using 70 instead.
>> > cat .\out.txt
>> 638,997
>>
>> "swtor" model results for 638,997.png:
>>
>> > tesseract --tessdata-dir -l swtor --psm 7 --oem 1 -c 
>> > tessedit_char_whitelist=',0123456789' 
>> 638,997.png out
>> Failed to load any lstm-specific dictionaries for lang swtor!!
>> Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica
>> Warning: Invalid resolution 0 dpi. Using 70 instead.
>> > cat .\out.txt
>> 3,9,997
>>
>> In general, digits are more erroneous, and there is a proliferation of 
>> commas.
>>
>> Do any other ideas come to mind? I appreciate your help Shree!
>>
>> On Saturday, September 19, 2020 at 12:12:19 PM UTC-5 Grad wrote:
>>
>>> If it turns out to be that simple, I will feel really relieved and 
>>> really stupid at the same time. I cannot believe I didn't catch this before 
>>> posting. Thank you for taking a look, I'll fix my ground-truth file creator 
>>> script and try again.
>>>
>>> On Saturday, September 19, 2020 at 12:01:50 PM UTC-5 shree wrote:
>>>
>>>> You will get better results when you fix your training data (I deleted 
>>>> all file names ending in -2 and -3).
>>>>
>>>> Mean rms=0.145%, delta=0.046%, train=0.214%(1.01%), skip ratio=0%
>>>> Iteration 396: GROUND  TRUTH : 5,500,000
>>>> File data/swtor-ground-truth/5,500,000.lstmf line 0 (Perfect):
>>>> Mean rms=0.145%, delta=0.046%, train=0.214%(1.008%), skip ratio=0%
>>>> Iteration 397: GROUND  TRUTH : 2,000,000
>>>> File data/swtor-ground-truth/2,000,000.lstmf line 0 (Perfect):
>>>> Mean rms=0.145%, delta=0.045%, train=0.213%(1.005%), skip ratio=0%
>>>> Iteration 398: GROUND  TRUTH : 6,435
>>>> File data/swtor-ground-truth/6,435.lstmf line 0 (Perfect):
>>>> Mean rms=0.145%, delta=0.045%, train=0.213%(1.003%), skip ratio=0%
>>>> Iteration 399: GROUND  TRUTH : 3,750,000
>>>> File data/swtor-ground-truth/3,750,000.lstmf line 0 (Perfect):
>>>> Mean rms=0.144%, delta=0.045%, train=0.212%(1%), skip ratio=0%
>>>> 2 Percent improvement time=4, best error was 100 @ 0
>>>> At iteration 4/400/400, Mean rms=0.144%, delta=0.045%, char 
>>>> train=0.212%, word train=1%, skip ratio=0%,  New best char error = 0.212 
>>>> wrote best model:data/swtor/checkpoints/swtor_0.212_4_400.checkpoint wrote 
>>>> checkpoint.
>>>>
>>>> Iteration 400: GROUND  TRUTH : 5,222,100
>>>> File data/swtor-ground-truth/5,222,100.lstmf line 0 (Perfect):
>>>> Mean rms=0.144%, delta=0.045%, train=0.212%(0.998%), skip ratio=0%
>>>> Iteration 401: GROUND  TRUTH : 696,969
>>>> File data/swtor-ground-truth/696,969.lstmf line 0 (Perfect):
>>>> Mean rms=0.144%, delta=0.045%, train=0.211%(0.995%), skip ratio=0%
>>>> Iteration 402: GROUND  TRUTH : 71,000,000
>>>> File data/swtor-ground-truth/71,000,000.lstmf line 0 (Perfect):
>>>> Mean rms=0.144%, delta=0.045%, train=0.211%(0.993%), skip ratio=0%
>>>> Iteration 403: GROUND  TRUTH : 64,500
>>>> File data/swtor-ground-truth/64,500.lstmf line 0 (Perfect):
>>>> Mean rms=0.144%, delta=0.045%, train=0.21%(0.99%), skip ratio=0%
>>>> Iteration 404: GROUND  TRUTH : 39,500,000
>>>> File data/swtor-ground-truth/39,500,000.lstmf line 0 (Perfect):
>>>> Mean rms=0.144%, delta=0.045%, train=0.21%(0.988%), skip ratio=0%
>>>> Iteration 405: GROUND  TRUTH : 4,500,000
>>>> File data/swtor-ground-truth/4,500,000.lstmf line 0 (Perfect):
>>>> Mean rms=0.143%, delta=0.045%, train=0.209%(0.985%), skip ratio=0%
>>>> Iteration 406: GROUND  TRUTH : 1,450,000
>>>>
>>>>
>>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
>>>>  Virus-free. 
>>>> www.avg.com 
>>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
>>>>  
>>>> <#m_4573838550678158057_m_3745996810865765477_m_-8209654746249460667_m_-4693331455246237650_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>>>
>>>> On Sat, Sep 19, 2020 at 10:15 PM Shree Devi Kumar <shree...@gmail.com> 
>>>> wrote:
>>>>
>>>>> > Each of my PNG files have file names that indicate ground truth, 
>>>>> and I have a little script that generates ground-truth TXT files from the 
>>>>> PNG file names.
>>>>>
>>>>> Please review your script. I notice a number of file names ending with 
>>>>> -2. The gt.txt files for the same also contain -2 while the image only 
>>>>> has 
>>>>> the number.
>>>>>
>>>>> Example files attached.
>>>>>
>>>>>
>>>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
>>>>>  Virus-free. 
>>>>> www.avg.com 
>>>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
>>>>>  
>>>>> <#m_4573838550678158057_m_3745996810865765477_m_-8209654746249460667_m_-4693331455246237650_m_2830491266519781149_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>>>>
>>>>
>>>>
>>>> -- 
>>>>
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>> -- 
>>
> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/70e5fed6-3035-4885-965c-0552560ef0f6n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/70e5fed6-3035-4885-965c-0552560ef0f6n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d1e0a335-2de8-4892-872f-e3459f695a19n%40googlegroups.com.

Re: [tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

Reply via email to