Re: [tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

Fazle Rabbi Fri, 09 Oct 2020 21:01:31 -0700

Hi. I have a similar goal in mind about finetuning the 'ben' traineddata 
with the pictures i am working with. The picture will be an id so the names 
of people have to be recognized correctly. I tried the (line image,ground 
truth) way of finetuning the traineddata with very small number of images. 
The result was not good- I was kinda surprised as i expected at least the 
performance of the default model. My question is if i have a substantial 
amount of images and then process and produce the line image and ground 
truth from it- will that help me in improving the detection?


On Sunday, September 27, 2020 at 9:21:17 PM UTC+6 Grad wrote:

> @shree thank you for the advice, it was helpful. I managed to get 
> everything working satisfactorily: after adding additional training images, 
> I now get perfect results (446 pass, 0 fail)! Furthermore, these results 
> come with using the built-in "eng" model. I ended up not needing to 
> re-train or fine-tune Tesseract. The ticket was finding the magic sequence 
> of image processing steps to perform on my source images to prepare them 
> for input to Tesseract OCR 
>
> I have battled with this problem since your response and have come close 
> to giving up more than once, thinking that perhaps Tesseract simply isn't 
> up to the task. But the limited character set and the uniformity of the 
> character appearances kept me going -- there just had to be a way to make 
> this work. I'd love to document all the things I tried, and what results 
> they gave, but there is just too much. A quick summary will have to suffice.
>
> *What got me close but ultimately didn't work*
>
>    - Resized my images so the text was 36px in height. I did this in 
>    Python using OpenCV and (wrongly I think) chose the cv2.INTER_AREA 
>    interpolation method.
>    - Tried different values for MAX_ITERATIONS in tesstrain's Makefile, 
>    and got varied results but nothing perfect.
>    - Downloaded 
>    
> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/digits_comma.traineddata
>  
>    and used it for the START_MODEL of tesstrain's Makefile (also had to set 
>    TESSDATA for the Makefile)
>    - Between these things, the best result I ever got was something like 
>    this (input on left, OCR output on right):
>    21,485,000 -> 21,483,000
>    21,875,000 -> 21,873,000
>    24,995 -> 24,999
>    5,450,000 -> 9,450,000
>    591,958 -> 9591,958
>    851 -> 8571
>    851 -> 8571
>    Pass: 428
>    Fail: 7
>    - So you can see, close, but still some pretty unforgivable errors 
>    (unforgivable to me due to the nature of my application -- these numbers 
>    need to be perfect)
>
> *What ultimately did work*
>
>    - In an act of desperation, and following a bit of a hunch, I 
>    abandoned trying to train/re-train/fine-tune, and just focused on getting 
>    perfect OCR on one of the images where it failed using "eng" model
>       - I chose this file 1,000,000.png, which produced an empty string 
>       when ran through Tesseract
>       - I used GIMP on Windows and opened 1,000,000.png and began 
>    adjusting/tweaking/filtering the image in various ways, each time 
> re-trying 
>    the OCR to see if the result changed. Using GIMP was crucial because it 
>    allowed me to iterate through trying different image processing techniques 
>    using a GUI, which was much quicker than doing the same thing in Python 
>    using OpenCV.
>    - Once I found what worked, I implemented it in Python. The magic 
>    steps ended up being:
>       1. Read the source image as color:
>       image_to_ocr = cv2.imread(raw_image_file_name, cv2.IMREAD_COLOR)
>       2. Use only the green channel of the source image. The numbers in 
>       my source images are mostly green tinted and I thought maybe this would 
>       help. This results in a grayscale image with a dark background and 
> white 
>       text:
>       b, image_to_ocr, r = cv2.split(image_to_ocr)
>       3. Enlarge the image by 2x. This resulted in text that is ~20px in 
>       height, and I found this to be necessary but sufficient. I also found 
> the 
>       use of cv2.INTER_CUBIC instead of cv2.INTER_AREA to be crucial here. I 
>       think the resizing (enlarging in my case) of the images was an absolute 
>       must-have. I'm really thankful I posted here and really thankful to 
> @shree 
>       for that little nugget of insight. 
>       image_to_ocr = cv2.resize(image_to_ocr, (image_to_ocr.shape[1] * 2, 
>       image_to_ocr.shape[0] * 2), interpolation = cv2.INTER_CUBIC)
>       4. Invert the image so that the background is white and the text is 
>       black. I am not sure if this step was necessary.
>       image_to_ocr = cv2.bitwise_not(image_to_ocr)
>       - With these steps, 1,000,000.png OCR'd perfectly
>    - I then re-ran my script to check accuracy on all 400+ source images, 
>    and got the perfect result. I was so nervous while the script was running; 
>    it prints out errors as it goes, and so many times before I'd run the 
>    script with eager anticipation that I'd finally gotten everything right, 
>    only to have an error appear. This time...it ran...seconds go by...more 
>    seconds go by...no errors...I can't look OMG...check back in 30 seconds, 
>    446 pass, 0 fail, I literally stood up and hooped and hollered with arms 
>    raised.
>    
>
> On Sunday, September 20, 2020 at 11:09:02 AM UTC-5 shree wrote:
>
>> Resize your images so that text is 36 pixels high. That's what is used 
>> for eng models.
>>
>> Since you are fine tuning, limit number of iterations to 400 or so (not 
>> 10000 which is default).
>>
>> Use dedug_level of -1 during training so that you can see the details per 
>> iteration.
>>
>>
>>
>> On Sun, Sep 20, 2020, 00:24 Grad <kes...@gmail.com> wrote:
>>
>>> I have fixed my ground-truth file creator script to eliminate the 
>>> badly-formed numbers and have re-run my experiment. Unfortunately, I am 
>>> still seeing really poor results (12 pass, 383 fail), even though the 
>>> training error rates appear to be much smaller this time around:
>>>
>>> At iteration 509/10000/10000, Mean rms=0.184%, delta=0.055%, char 
>>> train=0.344%, word train=2.5%, skip ratio=0%,  New worst char error = 0.344 
>>> wrote checkpoint.
>>>
>>> Finished! Error rate = 0.308
>>> lstmtraining \
>>> --stop_training \
>>> --continue_from data/swtor/checkpoints/swtor_checkpoint \
>>> --traineddata data/swtor/swtor.traineddata \
>>> --model_output data/swtor.traineddata
>>> Loaded file data/swtor/checkpoints/swtor_checkpoint, unpacking...
>>>
>>> Full log of "make training" is attached.
>>>
>>> When I run Tesseract using the "eng" and "swtor" models on the training 
>>> images, I'm seeing a the following types of results:
>>>
>>> "eng" model results for 638,997.png:
>>>
>>> > tesseract --psm 7 --oem 1 -c tessedit_char_whitelist=',0123456789' 
>>> > 638,997.png 
>>> out
>>> Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica
>>> Warning: Invalid resolution 0 dpi. Using 70 instead.
>>> > cat .\out.txt
>>> 638,997
>>>
>>> "swtor" model results for 638,997.png:
>>>
>>> > tesseract --tessdata-dir -l swtor --psm 7 --oem 1 -c 
>>> > tessedit_char_whitelist=',0123456789' 
>>> 638,997.png out
>>> Failed to load any lstm-specific dictionaries for lang swtor!!
>>> Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica
>>> Warning: Invalid resolution 0 dpi. Using 70 instead.
>>> > cat .\out.txt
>>> 3,9,997
>>>
>>> In general, digits are more erroneous, and there is a proliferation of 
>>> commas.
>>>
>>> Do any other ideas come to mind? I appreciate your help Shree!
>>>
>>> On Saturday, September 19, 2020 at 12:12:19 PM UTC-5 Grad wrote:
>>>
>>>> If it turns out to be that simple, I will feel really relieved and 
>>>> really stupid at the same time. I cannot believe I didn't catch this 
>>>> before 
>>>> posting. Thank you for taking a look, I'll fix my ground-truth file 
>>>> creator 
>>>> script and try again.
>>>>
>>>> On Saturday, September 19, 2020 at 12:01:50 PM UTC-5 shree wrote:
>>>>
>>>>> You will get better results when you fix your training data (I deleted 
>>>>> all file names ending in -2 and -3).
>>>>>
>>>>> Mean rms=0.145%, delta=0.046%, train=0.214%(1.01%), skip ratio=0%
>>>>> Iteration 396: GROUND  TRUTH : 5,500,000
>>>>> File data/swtor-ground-truth/5,500,000.lstmf line 0 (Perfect):
>>>>> Mean rms=0.145%, delta=0.046%, train=0.214%(1.008%), skip ratio=0%
>>>>> Iteration 397: GROUND  TRUTH : 2,000,000
>>>>> File data/swtor-ground-truth/2,000,000.lstmf line 0 (Perfect):
>>>>> Mean rms=0.145%, delta=0.045%, train=0.213%(1.005%), skip ratio=0%
>>>>> Iteration 398: GROUND  TRUTH : 6,435
>>>>> File data/swtor-ground-truth/6,435.lstmf line 0 (Perfect):
>>>>> Mean rms=0.145%, delta=0.045%, train=0.213%(1.003%), skip ratio=0%
>>>>> Iteration 399: GROUND  TRUTH : 3,750,000
>>>>> File data/swtor-ground-truth/3,750,000.lstmf line 0 (Perfect):
>>>>> Mean rms=0.144%, delta=0.045%, train=0.212%(1%), skip ratio=0%
>>>>> 2 Percent improvement time=4, best error was 100 @ 0
>>>>> At iteration 4/400/400, Mean rms=0.144%, delta=0.045%, char 
>>>>> train=0.212%, word train=1%, skip ratio=0%,  New best char error = 0.212 
>>>>> wrote best model:data/swtor/checkpoints/swtor_0.212_4_400.checkpoint 
>>>>> wrote 
>>>>> checkpoint.
>>>>>
>>>>> Iteration 400: GROUND  TRUTH : 5,222,100
>>>>> File data/swtor-ground-truth/5,222,100.lstmf line 0 (Perfect):
>>>>> Mean rms=0.144%, delta=0.045%, train=0.212%(0.998%), skip ratio=0%
>>>>> Iteration 401: GROUND  TRUTH : 696,969
>>>>> File data/swtor-ground-truth/696,969.lstmf line 0 (Perfect):
>>>>> Mean rms=0.144%, delta=0.045%, train=0.211%(0.995%), skip ratio=0%
>>>>> Iteration 402: GROUND  TRUTH : 71,000,000
>>>>> File data/swtor-ground-truth/71,000,000.lstmf line 0 (Perfect):
>>>>> Mean rms=0.144%, delta=0.045%, train=0.211%(0.993%), skip ratio=0%
>>>>> Iteration 403: GROUND  TRUTH : 64,500
>>>>> File data/swtor-ground-truth/64,500.lstmf line 0 (Perfect):
>>>>> Mean rms=0.144%, delta=0.045%, train=0.21%(0.99%), skip ratio=0%
>>>>> Iteration 404: GROUND  TRUTH : 39,500,000
>>>>> File data/swtor-ground-truth/39,500,000.lstmf line 0 (Perfect):
>>>>> Mean rms=0.144%, delta=0.045%, train=0.21%(0.988%), skip ratio=0%
>>>>> Iteration 405: GROUND  TRUTH : 4,500,000
>>>>> File data/swtor-ground-truth/4,500,000.lstmf line 0 (Perfect):
>>>>> Mean rms=0.143%, delta=0.045%, train=0.209%(0.985%), skip ratio=0%
>>>>> Iteration 406: GROUND  TRUTH : 1,450,000
>>>>>
>>>>>
>>>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
>>>>>  Virus-free. 
>>>>> www.avg.com 
>>>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
>>>>>  
>>>>> <#m_-1362665791027190050_m_4573838550678158057_m_3745996810865765477_m_-8209654746249460667_m_-4693331455246237650_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>>>>
>>>>> On Sat, Sep 19, 2020 at 10:15 PM Shree Devi Kumar <shree...@gmail.com> 
>>>>> wrote:
>>>>>
>>>>>> > Each of my PNG files have file names that indicate ground truth, 
>>>>>> and I have a little script that generates ground-truth TXT files from 
>>>>>> the 
>>>>>> PNG file names.
>>>>>>
>>>>>> Please review your script. I notice a number of file names ending 
>>>>>> with -2. The gt.txt files for the same also contain -2 while the image 
>>>>>> only 
>>>>>> has the number.
>>>>>>
>>>>>> Example files attached.
>>>>>>
>>>>>>
>>>>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
>>>>>>  Virus-free. 
>>>>>> www.avg.com 
>>>>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
>>>>>>  
>>>>>> <#m_-1362665791027190050_m_4573838550678158057_m_3745996810865765477_m_-8209654746249460667_m_-4693331455246237650_m_2830491266519781149_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>>>>>
>>>>>
>>>>>
>>>>> -- 
>>>>>
>>>>> ____________________________________________________________
>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>
>>>> -- 
>>>
>> You received this message because you are subscribed to the Google Groups 
>>> "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesseract-oc...@googlegroups.com.
>>>
>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/70e5fed6-3035-4885-965c-0552560ef0f6n%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/70e5fed6-3035-4885-965c-0552560ef0f6n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f20fef2a-367c-4b10-b1b5-f8349679b4edn%40googlegroups.com.

Re: [tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

Reply via email to