[tesseract-ocr] As good as Latin.traineddata (fast integer) but faster

O CR Wed, 08 Apr 2020 08:09:55 -0700

Hi all,

I try to read names on images with tesseract LSTM. Names like:


Śerena Kovitch

ŁAGUNA EVREIST

Äna Optici

Orğu Moninck


(I don't have to recognize words)


Latin.traineddata (fast integer) is doing well with the diacritics, but 
there are a lot of characters I don't need like numbers, %, ﹕ ,﹖ ,﹗,﹙ ,﹚ ,﹛ 
,﹜ ,﹝ ,﹞ ,﹟ ,﹠ ,﹡ ,﹢ ,﹣ ,﹤,﹥,﹦ ,﹨ ,﹩ ﹪ ,﹫,and much more. And so 
Latin.traineddata is too slow.

So I thought I take eng.traineddata (best float for LSTM) and I train it 
for the diacritics. But there are almost 400 diacritics. So I don't know if 
fine-tuning for such amount of characters is a good idea?

However I tried it but the quality is very poor.

I trained with eng.training_text (a English text of 72 lines) and I added 
all the diacritics several times. The char error rate during lstmeval is 
around 0.1. I did a test with 80 documents, and I read 30 names correct. 
(on each document there is one name). (time is similar to Latin.traineddata)


What can I do to get a model that is as good as Latin.traineddata on 
diacritics but is much faster in ocr reading? 


Thank you.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/b9ddf333-1229-45d3-9a02-809973294a47%40googlegroups.com.

[tesseract-ocr] As good as Latin.traineddata (fast integer) but faster

Reply via email to