Hello.
I need to OCR text with mix of latin and cyrillic letters plus emoji-like 
icons.
Text font is printed, not hand-written. I take line images as screenshots.
Original "rus+eng" version shows bad perfomance, probably due to the mix of 
scripts and many words do not belong to dictionary. And icons, of cause.

After some attempts to fine-tune 'rus.traineddata' I give up and decided to 
train new 'language' from scratch. I removed all cyrillic glyphs that looks 
similar to latin letters (like O, H, T, etc. - just replaced them in 
groundtruth text), added icond and trained new language on about ~3000 
short lines. But perfomance become even worse. I cannot provide more 
samples, so I decided to improve lstm training by adding exact boxes to 
glyphs. And after I've marked boxes the performance of trained detector 
rised extremely and it's completely acceptable now.

*BUT*. Trained with glyph boxes LSTM stops providing spaces in recognized 
text. It reports something like "HelloWorld" instead of "Hello World" even 
if there is a huge gap between words. Ok, I revised box files and added 
boxes for spaces. It did not help, Tesseract still does not recognize 
spaces between words. I've duplicated trained data, so it has both symbol 
boxes with (with spaces) and line boxes (one box for the whole line, as 
originally LSTM generates boxes). Now the tesseract trainer complains for 
every sample and reports huge character error rate, probably because of 
spaces (glyphs are detected correctly).

So. What how can I train LSTM with glyph boxes to recognize spaces between 
words? I cannot use line-boxes because of bad recognition perfomance, and I 
cannot use new traineddata because it misses spaces and does something 
wrong inside, overfitted to 'distinguish' between to add or not add spaces.





-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/08dfc7b7-ee03-4d6c-a943-99c78c4273ban%40googlegroups.com.

Reply via email to