[tesseract-ocr] Re: Guide me on training or better/practical pre-processing?

John Roxton Mon, 17 Jun 2024 18:13:14 -0700

Update:
After searching all the threads/discussions and reading posts, I decided to 
try out the example 'ocrd-testset' that comes with `tesstrain`. Following a 
recommendation to another user by @zednop, I ran the command `make training 
MODEL_NAME=ocrd START_MODEL=deu_latf TESSDATA=~/tessdata_best 
MAX_ITERATIONS=10000` and was able to see significant improvement, which I 
was able to verify compared to the default model.


Inspired, I tried training my own model (again) using the "Droid Sans" font 
with random ground-truth text generated from a limited character set 
("A-Za-z0-9._"), of variable lengths 5-12 characters, 
with a starting model of the tesseract_best eng.traineddata.  Initially, 
for the first ~35,000 iterations, training was showing signs of improvement 
with a BCER decreasing to about 92%.  However, then I noticed the BCER 
began to rise so I ended the training. Soon after, I continued hoping it 
wasn't abnormal, but the BCER continued to rise and rise all the way back 
to a BCER of 99.99%, at which point I ended it and haven't restarted it 
since.

The AIs tell me it's likely due to "over-fitting".  This is something I 
don't quite understand, yet.  I am wondering if the arbitrary nature of the 
text in the test set might be "short-circuiting" the prediction, and if 
maybe I should disable the dictionary.

Any suggestions?

On Monday, June 17, 2024 at 12:39:23 PM UTC-4 John Roxton wrote:

> I should clarify my issues with training my own model:
> I can generate all the needed data, but I simply cannot find a consistent 
> source that can guide me through the LSTM training process.  So, in case 
> anyone is wondering, I have not yet actually successfully trained and tried 
> my own model.  I have produced some .traineddata files that are larger than 
> the default eng.traineddata file, but fail to solve even the few images 
> above.  Furthermore, I cannot seem to replicate the training process!
>
> I will also mention that my solutions for post-processing with some sort 
> of fuzzy-matching process can be useful with longer strings, but fail 
> miserably with the shortest of strings, where the impact of a single 
> character being misinterpreted is more significant.
>
> On Monday, June 17, 2024 at 12:16:51 PM UTC-4 John Roxton wrote:
>
>> I'm using Tesseract 5.3.3
>>
>> My use-case is to perform OCR on username strings captured from various 
>> ROIs of screenshots.  These strings are 5-12 characters in length and make 
>> use of a set of allowable characters consisting of:  A-Za-z0-9._
>>
>> In general, it seems that Tesseract already does a pretty good job on my 
>> images, but due to the particular font that seems to be used (I believe it 
>> is "Droid Sans"), it often struggles with particular characters or 
>> character combinations.
>>
>> The most common mistake it makes is with O (capital o) and 0 (zero). 
>>  Another particularly tricky character/combination is with either case of 
>> the letter "J" as the "hook" in this letter for this font hangs below the 
>> horizon.  It also may mischaracterize a "I" (capital i) for "l" (lowercase 
>> L).
>>
>> I've found that `--psm 6` usually works best for my use-case.
>>
>> Reading through the `tesseract-ocr` and `tesstrain` documentation, and 
>> learning from what I can find elsewhere online, it seems:
>> - it is recommended that pre-processing images is better than training
>> - fine-tuning should be preferred over training from scratch
>>
>> Albeit, I am having great trouble in training my own model.  I have 
>> generated 10,000 `.tif` images of text  of assorted string lengths from 
>> 5-12 characters utilizing my restricted character set in random 
>> combinations using the "Droid Sans" font, along with associated "ground 
>> truth" files with matching file names and a `.gt.txt` extension. 
>> Additionally, I have many "in-the-field" images (such as those seen below) 
>> that I can provide "ground truth" text for.
>>
>>
>> Here are some particularly tricky images I've encountered:
>>
>> "CJR21" - often misinterpreted as "R21", "QR21", or "gR21"
>> [image: CJR21.png]
>>
>> "WPJ777" - Interpreted correctly using `--psm 6`
>> [image: WPJ777.png]
>>
>> "SenorC0le" - A common case of a "0" (zero) misinterpreted as a capital 
>> "O"
>> [image: SeenorC0le.png]
>>
>> "Iamagod" - capital i misinterpreted as a lowercase L[image: Iamagod.png]
>>
>> Example of Tesseract's "internal" pre-processing:
>> [image: Olympic-seat_4-25-3503-screenshot.processed.png]
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/02210c67-07d5-48a7-b309-ad3e15148b15n%40googlegroups.com.

[tesseract-ocr] Re: Guide me on training or better/practical pre-processing?

Reply via email to