Hi,

(I might butcher English grammar- you have been warned!)

   For some time I'm trying to teach tesseract to read MRZ 
codes.Unfortunately it's not going very well. I'm using the latest version 
of tesseract (4.0) soI'mm trying to train it by lstm method. I've managed 
to pull it off and got some custom traineddata samples but effects of using 
them are... let's say slightly unsatisfying. In the matter of fact they are 
not even remotely close to eng traineddata. I know that there was mrz 
traineddata in the previous version of tesseract.

I'm out of ideas how to improve accuracy, so I'll need your help guys. 

At first I thought I could use images, .txt files containing already read 
data and font data to somehow make box files (basically you have image and 
.txt containing everything read from the image). I was disappointed when I 
realized that without manual correction of boxes tesseract won't know how 
to apply them correctly. Of course I need automated method do apply boxes 
(I can't use any GUI or something).

At the moment I'm only using .txt files and these are steps I'm doing (it's 
also good to mention that I'm trying to make it from scratch):
-Using .txt and font (OcrB) to create .tiff and box files using text2image 
method
-Creating unicharset from all box files 
-(it's optional but for the sake of it) I'm applyingunicharsetproperties 
-Getting trainneddata from unicharset, langdata and using custom language 
as parameter 
-Creating lstmf file by tesseract some .tiff output lstm.train 
-Creating list of files to train 
-Running lstm training with net spec [1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 
Lrx96 Lfx256 O1c111] and learning rate 20e-4 
-At the end I'm using last checkpoint to create traineddata for usage. 
Currently initial .txt files are randomly generated by me in program in 
form of mrz code (samples included). I also tried to generate files in form 
of mixed alphabet to get signs variety. I was using about 1000 samples to 
train it and it doesn't differ from using 100 samples.

Also, I disabled dictionary in the OCR process to prevent tesseract from 
treating whole MRZ code as a word.

I might not understand some things despite reading a lot about this topic, 
but I'm pretty sure that I'm doing training process correctly. Do you have 
any tips how to improve training process? Consider pointing out even 
dumbest things I could forget about.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/b3b86804-5d86-4fac-a780-88a2ef4f2ba2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
IZPOLVAW4890675<<<<<<<<<<<<<<<
2805154F5408144POL280515950167
JÓZEF<BARAN<<<<<<<<<<<<<<<<<<<
IOPOLTCV0837027<<<<<<<<<<<<<<<
9111038F7805302POL911103457471
KAZIMIERA<TOMASZEWSKI<<<<<<<<<

Reply via email to