See https://github.com/Shreeshrii/tessdata_ocrb for the files and traineddata.
On Wed, Sep 5, 2018 at 8:51 PM, Shree Devi Kumar <shreesh...@gmail.com> wrote: > I think finetune will be a better option than training from scratch. > > Using a small training/test text - 40 lines, I get > > --------------------------------- > > + lstmeval --verbosity 0 --model /home/ubuntu/ > *tessdata_best/script/Latin.traineddata* --eval_listfile > /home/ubuntu/tesstutorial/ocrb/eng.training_files.txt > Loaded 40/40 pages (1-40) of document /home/ubuntu/tesstutorial/ > ocrb/eng.OCR-B_10_BT.exp0.lstmf > Loaded 40/40 pages (1-40) of document /home/ubuntu/tesstutorial/ > ocrb/eng.OCR_B_MT.exp0.lstmf > Warning: LSTMTrainer deserialized an LSTMRecognizer! > At iteration 0, stage 0, *Eval Char error rate=0.73106061*, *Word error > rate=13.75* > > --------------------------------- > > + lstmeval --verbosity 0 --model /home/ubuntu/ > *tessdata_best/eng.traineddata* --eval_listfile /home/ubuntu/tesstutorial/ > ocrb/eng.training_files.txt > Loaded 40/40 pages (1-40) of document /home/ubuntu/tesstutorial/ > ocrb/eng.OCR-B_10_BT.exp0.lstmf > Loaded 40/40 pages (1-40) of document /home/ubuntu/tesstutorial/ > ocrb/eng.OCR_B_MT.exp0.lstmf > Warning: LSTMTrainer deserialized an LSTMRecognizer! > At iteration 0, stage 0, *Eval Char error rate=47.444889, Word error > rate=92.5* > > > * --------------------------------- * > > *At iteration 16/410/410, Mean rms=0.236%, delta=0.131%, char > train=0.448%, word train=3.659%, skip ratio=0%, New best char error = > 0.448 wrote checkpoint.* > > *Finished! Error rate = 0.448* > > > * --------------------------------- * > > > + lstmeval --model > /home/ubuntu/tesstutorial/ocrb_from_full/*ocrb_plus_checkpoint > *--traineddata /home/ubuntu/tesstutorial/ocrb/eng/eng.traineddata > --eval_listfile /home/ubuntu/tesstutorial/ocrb/eng.training_files.txt > /home/ubuntu/tesstutorial/ocrb_from_full/ocrb_plus_checkpoint is not a > recognition model, trying training checkpoint... > Loaded 40/40 pages (1-40) of document /home/ubuntu/tesstutorial/ > ocrb/eng.OCR-B_10_BT.exp0.lstmf > Loaded 40/40 pages (1-40) of document /home/ubuntu/tesstutorial/ > ocrb/eng.OCR_B_MT.exp0.lstmf > At iteration 0, stage 0, *Eval Char error rate=0, Word error rate=0* > > --------------------------------- > > On Wed, Sep 5, 2018 at 1:55 PM, <kaminski.robert...@gmail.com> wrote: > >> Hi, >> >> (I might butcher English grammar- you have been warned!) >> >> For some time I'm trying to teach tesseract to read MRZ >> codes.Unfortunately it's not going very well. I'm using the latest version >> of tesseract (4.0) soI'mm trying to train it by lstm method. I've >> managed to pull it off and got some custom traineddata samples but >> effects of using them are... let's say slightly unsatisfying. In the matter >> of fact they are not even remotely close to eng traineddata. I know that >> there was mrz traineddata in the previous version of tesseract. >> >> I'm out of ideas how to improve accuracy, so I'll need your help guys. >> >> At first I thought I could use images, .txt files containing already >> read data and font data to somehow make box files (basically you have >> image and .txt containing everything read from the image). I was >> disappointed when I realized that without manual correction of boxes >> tesseract won't know how to apply them correctly. Of course I need >> automated method do apply boxes (I can't use any GUI or something). >> >> At the moment I'm only using .txt files and these are steps I'm doing >> (it's also good to mention that I'm trying to make it from scratch): >> -Using .txt and font (OcrB) to create .tiff and box files using >> text2image method >> -Creating unicharset from all box files >> -(it's optional but for the sake of it) I'm applyingunicharsetproperties >> -Getting trainneddata from unicharset, langdata and using custom >> language as parameter >> -Creating lstmf file by tesseract some .tiff output lstm.train >> -Creating list of files to train >> -Running lstm training with net spec [1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 >> Lfx96 Lrx96 Lfx256 O1c111] and learning rate 20e-4 >> -At the end I'm using last checkpoint to create traineddata for usage. >> Currently initial .txt files are randomly generated by me in program in >> form of mrz code (samples included). I also tried to generate files in >> form of mixed alphabet to get signs variety. I was using about 1000 samples >> to train it and it doesn't differ from using 100 samples. >> >> Also, I disabled dictionary in the OCR process to prevent tesseract from >> treating whole MRZ code as a word. >> >> I might not understand some things despite reading a lot about this >> topic, but I'm pretty sure that I'm doing training process correctly. Do >> you have any tips how to improve training process? Consider pointing out >> even dumbest things I could forget about. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-ocr+unsubscr...@googlegroups.com. >> To post to this group, send email to tesseract-ocr@googlegroups.com. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit https://groups.google.com/d/ms >> gid/tesseract-ocr/b3b86804-5d86-4fac-a780-88a2ef4f2ba2%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/b3b86804-5d86-4fac-a780-88a2ef4f2ba2%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > > > -- > > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > -- ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWAz-5o8bUu2W2kkg0U0WCg%3DJ1Fc4zHU9osV60wq77eKw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.