Re: [tesseract-ocr] Making custom traineddata

Shree Devi Kumar Wed, 05 Sep 2018 09:03:43 -0700

See https://github.com/Shreeshrii/tessdata_ocrb
for the files and traineddata.



On Wed, Sep 5, 2018 at 8:51 PM, Shree Devi Kumar <shreesh...@gmail.com>
wrote:

> I think finetune will be a better option than training from scratch.
>
> Using a small training/test text - 40 lines, I get
>
> ---------------------------------
>
> + lstmeval --verbosity 0 --model /home/ubuntu/
> *tessdata_best/script/Latin.traineddata* --eval_listfile
> /home/ubuntu/tesstutorial/ocrb/eng.training_files.txt
> Loaded 40/40 pages (1-40) of document /home/ubuntu/tesstutorial/
> ocrb/eng.OCR-B_10_BT.exp0.lstmf
> Loaded 40/40 pages (1-40) of document /home/ubuntu/tesstutorial/
> ocrb/eng.OCR_B_MT.exp0.lstmf
> Warning: LSTMTrainer deserialized an LSTMRecognizer!
> At iteration 0, stage 0, *Eval Char error rate=0.73106061*, *Word error
> rate=13.75*
>
> ---------------------------------
>
> + lstmeval --verbosity 0 --model /home/ubuntu/
> *tessdata_best/eng.traineddata* --eval_listfile /home/ubuntu/tesstutorial/
> ocrb/eng.training_files.txt
> Loaded 40/40 pages (1-40) of document /home/ubuntu/tesstutorial/
> ocrb/eng.OCR-B_10_BT.exp0.lstmf
> Loaded 40/40 pages (1-40) of document /home/ubuntu/tesstutorial/
> ocrb/eng.OCR_B_MT.exp0.lstmf
> Warning: LSTMTrainer deserialized an LSTMRecognizer!
> At iteration 0, stage 0, *Eval Char error rate=47.444889, Word error
> rate=92.5*
>
>
> * --------------------------------- *
>
> *At iteration 16/410/410, Mean rms=0.236%, delta=0.131%, char
> train=0.448%, word train=3.659%, skip ratio=0%,  New best char error =
> 0.448 wrote checkpoint.*
>
> *Finished! Error rate = 0.448*
>
>
> * --------------------------------- *
>
>
> + lstmeval --model 
> /home/ubuntu/tesstutorial/ocrb_from_full/*ocrb_plus_checkpoint
> *--traineddata /home/ubuntu/tesstutorial/ocrb/eng/eng.traineddata
> --eval_listfile /home/ubuntu/tesstutorial/ocrb/eng.training_files.txt
> /home/ubuntu/tesstutorial/ocrb_from_full/ocrb_plus_checkpoint is not a
> recognition model, trying training checkpoint...
> Loaded 40/40 pages (1-40) of document /home/ubuntu/tesstutorial/
> ocrb/eng.OCR-B_10_BT.exp0.lstmf
> Loaded 40/40 pages (1-40) of document /home/ubuntu/tesstutorial/
> ocrb/eng.OCR_B_MT.exp0.lstmf
> At iteration 0, stage 0, *Eval Char error rate=0, Word error rate=0*
>
> ---------------------------------
>
> On Wed, Sep 5, 2018 at 1:55 PM, <kaminski.robert...@gmail.com> wrote:
>
>> Hi,
>>
>> (I might butcher English grammar- you have been warned!)
>>
>>    For some time I'm trying to teach tesseract to read MRZ
>> codes.Unfortunately it's not going very well. I'm using the latest version
>> of tesseract (4.0) soI'mm trying to train it by lstm method. I've
>> managed to pull it off and got some custom traineddata samples but
>> effects of using them are... let's say slightly unsatisfying. In the matter
>> of fact they are not even remotely close to eng traineddata. I know that
>> there was mrz traineddata in the previous version of tesseract.
>>
>> I'm out of ideas how to improve accuracy, so I'll need your help guys.
>>
>> At first I thought I could use images, .txt files containing already
>> read data and font data to somehow make box files (basically you have
>> image and .txt containing everything read from the image). I was
>> disappointed when I realized that without manual correction of boxes
>> tesseract won't know how to apply them correctly. Of course I need
>> automated method do apply boxes (I can't use any GUI or something).
>>
>> At the moment I'm only using .txt files and these are steps I'm doing
>> (it's also good to mention that I'm trying to make it from scratch):
>> -Using .txt and font (OcrB) to create .tiff and box files using
>> text2image method
>> -Creating unicharset from all box files
>> -(it's optional but for the sake of it) I'm applyingunicharsetproperties
>> -Getting trainneddata from unicharset, langdata and using custom
>> language as parameter
>> -Creating lstmf file by tesseract some .tiff output lstm.train
>> -Creating list of files to train
>> -Running lstm training with net spec [1,36,0,1 Ct3,3,16 Mp3,3 Lfys48
>> Lfx96 Lrx96 Lfx256 O1c111] and learning rate 20e-4
>> -At the end I'm using last checkpoint to create traineddata for usage.
>> Currently initial .txt files are randomly generated by me in program in
>> form of mrz code (samples included). I also tried to generate files in
>> form of mixed alphabet to get signs variety. I was using about 1000 samples
>> to train it and it doesn't differ from using 100 samples.
>>
>> Also, I disabled dictionary in the OCR process to prevent tesseract from
>> treating whole MRZ code as a word.
>>
>> I might not understand some things despite reading a lot about this
>> topic, but I'm pretty sure that I'm doing training process correctly. Do
>> you have any tips how to improve training process? Consider pointing out
>> even dumbest things I could forget about.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/b3b86804-5d86-4fac-a780-88a2ef4f2ba2%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/b3b86804-5d86-4fac-a780-88a2ef4f2ba2%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> --
>
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>



-- 

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWAz-5o8bUu2W2kkg0U0WCg%3DJ1Fc4zHU9osV60wq77eKw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Making custom traineddata

Reply via email to