[tesseract-ocr] Training lstm with symbol boxes

Maxim Kizub Thu, 17 Jul 2025 08:57:41 -0700

Hello.
I need to OCR text with mix of latin and cyrillic letters plus emoji-like 
icons.
Text font is printed, not hand-written. I take line images as screenshots.
Original "rus+eng" version shows bad perfomance, probably due to the mix of 
scripts and many words do not belong to dictionary. And icons, of cause.

After some attempts to fine-tune 'rus.traineddata' I give up and decided to
train new 'language' from scratch. I removed all cyrillic glyphs that looks
similar to latin letters (like O, H, T, etc. - just replaced them in
groundtruth text), added icond and trained new language on about ~3000
short lines. But perfomance become even worse. I cannot provide more
samples, so I decided to improve lstm training by adding exact boxes to
glyphs. And after I've marked boxes the performance of trained detector
rised extremely and it's completely acceptable now.

*BUT*. Trained with glyph boxes LSTM stops providing spaces in recognized
text. It reports something like "HelloWorld" instead of "Hello World" even
if there is a huge gap between words. Ok, I revised box files and added
boxes for spaces. It did not help, Tesseract still does not recognize
spaces between words. I've duplicated trained data, so it has both symbol
boxes with (with spaces) and line boxes (one box for the whole line, as
originally LSTM generates boxes). Now the tesseract trainer complains for
every sample and reports huge character error rate, probably because of
spaces (glyphs are detected correctly).

So. What how can I train LSTM with glyph boxes to recognize spaces between
words? I cannot use line-boxes because of bad recognition perfomance, and I
cannot use new traineddata because it misses spaces and does something
wrong inside, overfitted to 'distinguish' between to add or not add spaces.

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion visit
https://groups.google.com/d/msgid/tesseract-ocr/08dfc7b7-ee03-4d6c-a943-99c78c4273ban%40googlegroups.com.

[tesseract-ocr] Training lstm with symbol boxes

Reply via email to