Update:
After searching all the threads/discussions and reading posts, I decided to
try out the example 'ocrd-testset' that comes with `tesstrain`. Following a
recommendation to another user by @zednop, I ran the command `make training
MODEL_NAME=ocrd START_MODEL=deu_latf TESSDATA=~/tessdata_best
MAX_ITERATIONS=10000` and was able to see significant improvement, which I
was able to verify compared to the default model.
Inspired, I tried training my own model (again) using the "Droid Sans" font
with random ground-truth text generated from a limited character set
("A-Za-z0-9._"), of variable lengths 5-12 characters,
with a starting model of the tesseract_best eng.traineddata. Initially,
for the first ~35,000 iterations, training was showing signs of improvement
with a BCER decreasing to about 92%. However, then I noticed the BCER
began to rise so I ended the training. Soon after, I continued hoping it
wasn't abnormal, but the BCER continued to rise and rise all the way back
to a BCER of 99.99%, at which point I ended it and haven't restarted it
since.
The AIs tell me it's likely due to "over-fitting". This is something I
don't quite understand, yet. I am wondering if the arbitrary nature of the
text in the test set might be "short-circuiting" the prediction, and if
maybe I should disable the dictionary.
Any suggestions?
On Monday, June 17, 2024 at 12:39:23 PM UTC-4 John Roxton wrote:
> I should clarify my issues with training my own model:
> I can generate all the needed data, but I simply cannot find a consistent
> source that can guide me through the LSTM training process. So, in case
> anyone is wondering, I have not yet actually successfully trained and tried
> my own model. I have produced some .traineddata files that are larger than
> the default eng.traineddata file, but fail to solve even the few images
> above. Furthermore, I cannot seem to replicate the training process!
>
> I will also mention that my solutions for post-processing with some sort
> of fuzzy-matching process can be useful with longer strings, but fail
> miserably with the shortest of strings, where the impact of a single
> character being misinterpreted is more significant.
>
> On Monday, June 17, 2024 at 12:16:51 PM UTC-4 John Roxton wrote:
>
>> I'm using Tesseract 5.3.3
>>
>> My use-case is to perform OCR on username strings captured from various
>> ROIs of screenshots. These strings are 5-12 characters in length and make
>> use of a set of allowable characters consisting of: A-Za-z0-9._
>>
>> In general, it seems that Tesseract already does a pretty good job on my
>> images, but due to the particular font that seems to be used (I believe it
>> is "Droid Sans"), it often struggles with particular characters or
>> character combinations.
>>
>> The most common mistake it makes is with O (capital o) and 0 (zero).
>> Another particularly tricky character/combination is with either case of
>> the letter "J" as the "hook" in this letter for this font hangs below the
>> horizon. It also may mischaracterize a "I" (capital i) for "l" (lowercase
>> L).
>>
>> I've found that `--psm 6` usually works best for my use-case.
>>
>> Reading through the `tesseract-ocr` and `tesstrain` documentation, and
>> learning from what I can find elsewhere online, it seems:
>> - it is recommended that pre-processing images is better than training
>> - fine-tuning should be preferred over training from scratch
>>
>> Albeit, I am having great trouble in training my own model. I have
>> generated 10,000 `.tif` images of text of assorted string lengths from
>> 5-12 characters utilizing my restricted character set in random
>> combinations using the "Droid Sans" font, along with associated "ground
>> truth" files with matching file names and a `.gt.txt` extension.
>> Additionally, I have many "in-the-field" images (such as those seen below)
>> that I can provide "ground truth" text for.
>>
>>
>> Here are some particularly tricky images I've encountered:
>>
>> "CJR21" - often misinterpreted as "R21", "QR21", or "gR21"
>> [image: CJR21.png]
>>
>> "WPJ777" - Interpreted correctly using `--psm 6`
>> [image: WPJ777.png]
>>
>> "SenorC0le" - A common case of a "0" (zero) misinterpreted as a capital
>> "O"
>> [image: SeenorC0le.png]
>>
>> "Iamagod" - capital i misinterpreted as a lowercase L[image: Iamagod.png]
>>
>> Example of Tesseract's "internal" pre-processing:
>> [image: Olympic-seat_4-25-3503-screenshot.processed.png]
>>
>
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/02210c67-07d5-48a7-b309-ad3e15148b15n%40googlegroups.com.