Hi Shree, We have tried your traineddata file for MRZ and noticed that it does not detect the character X.
Looking at https://github.com/Shreeshrii/tessdata_ocrb/blob/master/eng.MRZ.training_text We see that there are no X in there. In addition it might be good to add a couple of lines that are specific for IDs (starting with I) note they are all fake IDESPANH186495123456789X<<<<<< IXESPE002561410<0233181G<<<<< I<NLDIS2KX87214<<<<<<<<<<<<<<< On Wednesday, 5 September 2018 18:03:41 UTC+2, shree wrote: > > See https://github.com/Shreeshrii/tessdata_ocrb > for the files and traineddata. > > > On Wed, Sep 5, 2018 at 8:51 PM, Shree Devi Kumar <shree...@gmail.com > <javascript:>> wrote: > >> I think finetune will be a better option than training from scratch. >> >> Using a small training/test text - 40 lines, I get >> >> --------------------------------- >> >> + lstmeval --verbosity 0 --model /home/ubuntu/ >> *tessdata_best/script/Latin.traineddata* --eval_listfile >> /home/ubuntu/tesstutorial/ocrb/eng.training_files.txt >> Loaded 40/40 pages (1-40) of document >> /home/ubuntu/tesstutorial/ocrb/eng.OCR-B_10_BT.exp0.lstmf >> Loaded 40/40 pages (1-40) of document >> /home/ubuntu/tesstutorial/ocrb/eng.OCR_B_MT.exp0.lstmf >> Warning: LSTMTrainer deserialized an LSTMRecognizer! >> At iteration 0, stage 0, *Eval Char error rate=0.73106061*, *Word error >> rate=13.75* >> >> --------------------------------- >> >> + lstmeval --verbosity 0 --model /home/ubuntu/ >> *tessdata_best/eng.traineddata* --eval_listfile >> /home/ubuntu/tesstutorial/ocrb/eng.training_files.txt >> Loaded 40/40 pages (1-40) of document >> /home/ubuntu/tesstutorial/ocrb/eng.OCR-B_10_BT.exp0.lstmf >> Loaded 40/40 pages (1-40) of document >> /home/ubuntu/tesstutorial/ocrb/eng.OCR_B_MT.exp0.lstmf >> Warning: LSTMTrainer deserialized an LSTMRecognizer! >> At iteration 0, stage 0, *Eval Char error rate=47.444889, Word error >> rate=92.5* >> >> >> * --------------------------------- * >> >> *At iteration 16/410/410, Mean rms=0.236%, delta=0.131%, char >> train=0.448%, word train=3.659%, skip ratio=0%, New best char error = >> 0.448 wrote checkpoint.* >> >> *Finished! Error rate = 0.448* >> >> >> * --------------------------------- * >> >> >> + lstmeval --model >> /home/ubuntu/tesstutorial/ocrb_from_full/*ocrb_plus_checkpoint >> *--traineddata /home/ubuntu/tesstutorial/ocrb/eng/eng.traineddata >> --eval_listfile /home/ubuntu/tesstutorial/ocrb/eng.training_files.txt >> /home/ubuntu/tesstutorial/ocrb_from_full/ocrb_plus_checkpoint is not a >> recognition model, trying training checkpoint... >> Loaded 40/40 pages (1-40) of document >> /home/ubuntu/tesstutorial/ocrb/eng.OCR-B_10_BT.exp0.lstmf >> Loaded 40/40 pages (1-40) of document >> /home/ubuntu/tesstutorial/ocrb/eng.OCR_B_MT.exp0.lstmf >> At iteration 0, stage 0, *Eval Char error rate=0, Word error rate=0* >> >> --------------------------------- >> >> On Wed, Sep 5, 2018 at 1:55 PM, <kaminski...@gmail.com <javascript:>> >> wrote: >> >>> Hi, >>> >>> (I might butcher English grammar- you have been warned!) >>> >>> For some time I'm trying to teach tesseract to read MRZ >>> codes.Unfortunately it's not going very well. I'm using the latest version >>> of tesseract (4.0) soI'mm trying to train it by lstm method. I've >>> managed to pull it off and got some custom traineddata samples but >>> effects of using them are... let's say slightly unsatisfying. In the matter >>> of fact they are not even remotely close to eng traineddata. I know >>> that there was mrz traineddata in the previous version of tesseract. >>> >>> I'm out of ideas how to improve accuracy, so I'll need your help guys. >>> >>> At first I thought I could use images, .txt files containing already >>> read data and font data to somehow make box files (basically you have >>> image and .txt containing everything read from the image). I was >>> disappointed when I realized that without manual correction of boxes >>> tesseract won't know how to apply them correctly. Of course I need >>> automated method do apply boxes (I can't use any GUI or something). >>> >>> At the moment I'm only using .txt files and these are steps I'm doing >>> (it's also good to mention that I'm trying to make it from scratch): >>> -Using .txt and font (OcrB) to create .tiff and box files using >>> text2image method >>> -Creating unicharset from all box files >>> -(it's optional but for the sake of it) I'm applyingunicharsetproperties >>> >>> -Getting trainneddata from unicharset, langdata and using custom >>> language as parameter >>> -Creating lstmf file by tesseract some .tiff output lstm.train >>> -Creating list of files to train >>> -Running lstm training with net spec [1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 >>> Lfx96 Lrx96 Lfx256 O1c111] and learning rate 20e-4 >>> -At the end I'm using last checkpoint to create traineddata for usage. >>> Currently initial .txt files are randomly generated by me in program in >>> form of mrz code (samples included). I also tried to generate files in >>> form of mixed alphabet to get signs variety. I was using about 1000 samples >>> to train it and it doesn't differ from using 100 samples. >>> >>> Also, I disabled dictionary in the OCR process to prevent tesseract from >>> treating whole MRZ code as a word. >>> >>> I might not understand some things despite reading a lot about this >>> topic, but I'm pretty sure that I'm doing training process correctly. Do >>> you have any tips how to improve training process? Consider pointing out >>> even dumbest things I could forget about. >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesser...@googlegroups.com <javascript:>. >>> To post to this group, send email to tesser...@googlegroups.com >>> <javascript:>. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/b3b86804-5d86-4fac-a780-88a2ef4f2ba2%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/tesseract-ocr/b3b86804-5d86-4fac-a780-88a2ef4f2ba2%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> >> >> -- >> >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> > > > > -- > > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a8ddadfc-ac03-4169-8de3-68da65910ba6%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.