[tesseract-ocr] Making custom traineddata

kaminski . robert . it Wed, 05 Sep 2018 02:37:58 -0700

Hi,

(I might butcher English grammar- you have been warned!)

For some time I'm trying to teach tesseract to read MRZ
codes.Unfortunately it's not going very well. I'm using the latest version
of tesseract (4.0) soI'mm trying to train it by lstm method. I've managed
to pull it off and got some custom traineddata samples but effects of using
them are... let's say slightly unsatisfying. In the matter of fact they are
not even remotely close to eng traineddata. I know that there was mrz
traineddata in the previous version of tesseract.

I'm out of ideas how to improve accuracy, so I'll need your help guys.

At first I thought I could use images, .txt files containing already read
data and font data to somehow make box files (basically you have image and
.txt containing everything read from the image). I was disappointed when I
realized that without manual correction of boxes tesseract won't know how
to apply them correctly. Of course I need automated method do apply boxes
(I can't use any GUI or something).

At the moment I'm only using .txt files and these are steps I'm doing (it's
also good to mention that I'm trying to make it from scratch):
-Using .txt and font (OcrB) to create .tiff and box files using text2image
method
-Creating unicharset from all box files
-(it's optional but for the sake of it) I'm applyingunicharsetproperties
-Getting trainneddata from unicharset, langdata and using custom language
as parameter
-Creating lstmf file by tesseract some .tiff output lstm.train
-Creating list of files to train
-Running lstm training with net spec [1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96
Lrx96 Lfx256 O1c111] and learning rate 20e-4
-At the end I'm using last checkpoint to create traineddata for usage.
Currently initial .txt files are randomly generated by me in program in
form of mrz code (samples included). I also tried to generate files in form
of mixed alphabet to get signs variety. I was using about 1000 samples to
train it and it doesn't differ from using 100 samples.

Also, I disabled dictionary in the OCR process to prevent tesseract from
treating whole MRZ code as a word.

I might not understand some things despite reading a lot about this topic,
but I'm pretty sure that I'm doing training process correctly. Do you have
any tips how to improve training process? Consider pointing out even
dumbest things I could forget about.

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/b3b86804-5d86-4fac-a780-88a2ef4f2ba2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

IZPOLVAW4890675<<<<<<<<<<<<<<<
2805154F5408144POL280515950167
JÓZEF<BARAN<<<<<<<<<<<<<<<<<<<

IOPOLTCV0837027<<<<<<<<<<<<<<<
9111038F7805302POL911103457471
KAZIMIERA<TOMASZEWSKI<<<<<<<<<

[tesseract-ocr] Making custom traineddata

Reply via email to