[tesseract-ocr] non english user-patterns

Karoly Makonyi Wed, 08 Apr 2020 10:37:21 -0700

Hello,

I should read out fixed-format time and date from images.
The task is rather trivial, but tesseract performs weirdly.
I am using the Danish trained model. The format of the date string is 
dd.mm.yy the time is hh:mm.
Very often the ':' in the time is recognized as '1', but this is not 
difficult to correct.
In the date I experienced letter 'U' and 'O' instead of number '0' (this is 
neither very difficult to postprocess) and letter 'U' and 'H' instead of 
number '11'.
This is harder ...
The English pretrained model works - on the checked examples - perfectly 
(but I cant use it because the our embedded system has not enough memory).
I can build whitelist of characters with numbers and separators only. The 
precision doesn't inclease too much ...


Because of the format is fixed, I tried to use patterns: \d\d.\d\d.\d\d for 
the date and \d\d:\d\d for the time.
With English model the pattern file is accepted and obviously is used, but 
the accuracy drops (starts to mismatch the ':' with '1', putting space 
between day, month and year ,,,)
With the danish model I get error message (sorry I can't quote it (I am on 
an other computer), but it cant recognize the format of the regexp, or 
similar ...) with the _same_ pattern file.

How the pattern file depend on the language?
What other way one can imagine to improve my model ...

I am _no_t using LSTM but tesseract 4.0.0 on linux.

Thanks in advance,
Karoly


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/def598f3-4e33-4d73-b3a5-9615192b3ff3%40googlegroups.com.

[tesseract-ocr] non english user-patterns

Reply via email to