Re: [tesseract-ocr] Fine tuning existing model

Shree Devi Kumar Fri, 29 Jun 2018 09:28:14 -0700

You should be able to use the new makefile after you make changes for all
the directory locations to match your setup.


Change the language from frk to eng, though the sample training text seems
to be non-english. In which case it is better for you to use the
appropriate language traineddata eg. tessdata_best/deu.traineddata for
German.

On Fri, Jun 29, 2018 at 9:03 PM Lorenzo Bolzani <l.bolz...@gmail.com> wrote:

> Hi Shree, thanks for your answer.
>
> I tried the script setting:
>
> TESSDATA=extracted                 # here I have the eng.lstm and
> eng.trainedata
> LANGDATA=langdata-master     # all langdata downladed by OCR-D
>
> MODEL_NAME = eng
> CONTINUE_FROM = eng
>
>
> First I run the old Makefile to create the boxes.
>
> $ make training MODEL_NAME=eng
>
>
> I stop it as soon as the training starts:
>
> At iteration 400/400/400, Mean rms=6.657%, delta=40.765%, char
> train=100.827%, word train=100%, skip ratio=0%,  New worst char error =
> 100.827 wrote checkpoint.
>
>
> At iteration 500/500/500, Mean rms=6.644%, delta=40.423%, char
> train=100.662%, word train=100%, skip ratio=0%,  New worst char error =
> 100.662 wrote checkpoint.
>
> ^Cmake: *** Deleting file 'data/checkpoints/eng_checkpoint'
> Makefile:110: recipe for target 'data/checkpoints/eng_checkpoint' failed
> make: *** [data/checkpoints/eng_checkpoint] Interrupt
>
> Notice that the data/checkpoints/eng_checkpoint file is deleted, I do not
> know if it is relevant or not.
>
>
> then I switch to the new one and I get this:
>
> $ make training
>
> mkdir -p data/checkpoints
> lstmtraining \
>   --continue_from   extracted/eng.lstm \
>   --old_traineddata extracted/eng.traineddata \
>   --traineddata data/eng/eng.traineddata \
>   --model_output data/checkpoints/eng \
>   --debug_interval -1 \
>   --train_listfile data/list.train \
>   --eval_listfile data/list.eval \
>   --sequential_training \
>   --max_iterations 3000
> Loaded file extracted/eng.lstm, unpacking...
> Warning: LSTMTrainer deserialized an LSTMRecognizer!
> Code range changed from 111 to 76!
> Num (Extended) outputs,weights in Series:
>   1,36,0,1:1, 0
> Num (Extended) outputs,weights in Series:
>   C3,3:9, 0
>   Ft16:16, 160
> Total weights = 160
>   [C3,3Ft16]:16, 160
>   Mp3,3:16, 0
>   Lfys64:64, 20736
>   Lfx96:96, 61824
>   Lrx96:96, 74112
>   Lfx512:512, 1247232
>   Fc76:76, 0
> Total weights = 1404064
> Previous null char=110 mapped to 75
> Continuing from extracted/eng.lstm
> Loaded 1/1 pages (1-1) of document
> data/train/mueller_waldhornist_1821_0130_010.lstmf
> Loaded 1/1 pages (1-1) of document
> data/train/bismarck_erinnerungen02_1898_0274_002.lstmf
> Loaded 1/1 pages (1-1) of document
> data/train/spyri_heidi_1880_0062_005.lstmf
> Loaded 1/1 pages (1-1) of document
> data/train/novalis_ofterdingen_1802_0210_001.lstmf
> Iteration 0: ALIGNED TRUTH : Sparoͤfen kauft' ich auch und Sorgenstuͤhle,
> Iteration 0: BEST OCR TEXT : l bd o D V fc ds ft hs D t' dsu PM )k ,„cGs D
> t' D„Gs 'A AKG„9„t d tft ü!Vt Eb ht Ac )k uF ' K,cGPFVts
> File data/train/mueller_waldhornist_1821_0130_010.lstmf page 0 :
> !int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244
> !int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244
> Makefile:113: recipe for target 'data/checkpoints/eng_checkpoint' failed
> make: *** [data/checkpoints/eng_checkpoint] Segmentation fault
>
>
> What am I doing wrong?
>
>
>
> Lorenzo
>
> 2018-06-29 14:08 GMT+02:00 Shree Devi Kumar <shreesh...@gmail.com>:
>
>> I modified the makefile for ocrd-train to do fine-tuning.  It is pasted
>> below:
>>
>> export
>>
>> SHELL := /bin/bash
>> LOCAL := $(PWD)/usr
>> PATH := $(LOCAL)/bin:$(PATH)
>> HOME := /home/ubuntu
>> TESSDATA =  $(HOME)/tessdata_best
>> LANGDATA = $(HOME)/langdata
>>
>> # Name of the model to be built
>> MODEL_NAME = frk
>>
>> # Name of the model to continue from
>> CONTINUE_FROM = frk
>>
>> # Normalization Mode - see src/training/language_specific.sh for details
>> NORM_MODE = 2
>>
>> # Tesseract model repo to use. Default: $(TESSDATA_REPO)
>> TESSDATA_REPO = _best
>>
>> # Train directory
>> TRAIN := data/train
>>
>> # BEGIN-EVAL makefile-parser --make-help Makefile
>>
>> help:
>> @echo ""
>> @echo "  Targets"
>> @echo ""
>> @echo "    unicharset       Create unicharset"
>> @echo "    lists            Create lists of lstmf filenames for training
>> and eval"
>> @echo "    training         Start training"
>> @echo "    proto-model      Build the proto model"
>> @echo "    leptonica        Build leptonica"
>> @echo "    tesseract        Build tesseract"
>> @echo "    tesseract-langs  Download tesseract-langs"
>> @echo "    langdata         Download langdata"
>> @echo "    clean            Clean all generated files"
>> @echo ""
>> @echo "  Variables"
>> @echo ""
>> @echo "    MODEL_NAME         Name of the model to be built"
>> @echo "    CORES              No of cores to use for compiling
>> leptonica/tesseract"
>> @echo "    LEPTONICA_VERSION  Leptonica version. Default:
>> $(LEPTONICA_VERSION)"
>> @echo "    TESSERACT_VERSION  Tesseract commit. Default:
>> $(TESSERACT_VERSION)"
>> @echo "    LANGDATA_VERSION   Tesseract langdata version. Default:
>> $(LANGDATA_VERSION)"
>> @echo "    TESSDATA_REPO      Tesseract model repo to use. Default:
>> $(TESSDATA_REPO)"
>> @echo "    TRAIN              Train directory"
>> @echo "    RATIO_TRAIN        Ratio of train / eval training data"
>>
>> # END-EVAL
>>
>> # Ratio of train / eval training data
>> RATIO_TRAIN := 0.90
>>
>> ALL_BOXES = data/all-boxes
>> ALL_LSTMF = data/all-lstmf
>>
>> # Create unicharset
>> unicharset: data/unicharset
>>
>> # Create lists of lstmf filenames for training and eval
>> lists: $(ALL_LSTMF) data/list.train data/list.eval
>>
>> data/list.train: $(ALL_LSTMF)
>> total=`cat $(ALL_LSTMF) | wc -l` \
>>    no=`echo "$$total * $(RATIO_TRAIN) / 1" | bc`; \
>>    head -n "$$no" $(ALL_LSTMF) > "$@"
>>
>> data/list.eval: $(ALL_LSTMF)
>> total=`cat $(ALL_LSTMF) | wc -l` \
>>    no=`echo "($$total - $$total * $(RATIO_TRAIN)) / 1" | bc`; \
>>    tail -n "+$$no" $(ALL_LSTMF) > "$@"
>>
>> # Start training
>> training: data/$(MODEL_NAME).traineddata
>>
>> data/unicharset: $(ALL_BOXES)
>> combine_tessdata -u $(TESSDATA)/$(CONTINUE_FROM).traineddata
>> $(TESSDATA)/$(CONTINUE_FROM).
>> unicharset_extractor --output_unicharset "$(TRAIN)/my.unicharset"
>> --norm_mode $(NORM_MODE) "$(ALL_BOXES)"
>> merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset
>> $(TRAIN)/my.unicharset  "$@"
>> $(ALL_BOXES): $(sort $(patsubst %.tif,%.box,$(wildcard $(TRAIN)/*.tif)))
>> find $(TRAIN) -name '*.box' -exec cat {} \; > "$@"
>> $(TRAIN)/%.box: $(TRAIN)/%.tif $(TRAIN)/%-gt.txt
>> python generate_line_box.py -i "$(TRAIN)/$*.tif" -t "$(TRAIN)/$*-gt.txt"
>> > "$@"
>>
>> $(ALL_LSTMF): $(sort $(patsubst %.tif,%.lstmf,$(wildcard $(TRAIN)/*.tif)))
>> find $(TRAIN) -name '*.lstmf' -exec echo {} \; | sort -R -o "$@"
>>
>> $(TRAIN)/%.lstmf: $(TRAIN)/%.box
>> tesseract $(TRAIN)/$*.tif $(TRAIN)/$*   --psm 6 lstm.train
>>
>> # Build the proto model
>> proto-model: data/$(MODEL_NAME)/$(MODEL_NAME).traineddata
>>
>> data/$(MODEL_NAME)/$(MODEL_NAME).traineddata: $(LANGDATA) data/unicharset
>> combine_lang_model \
>>   --input_unicharset data/unicharset \
>>   --script_dir $(LANGDATA) \
>>   --words $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).wordlist \
>>   --numbers $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).numbers \
>>   --puncs $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).punc \
>>   --output_dir data/ \
>>   --lang $(MODEL_NAME)
>>
>> data/checkpoints/$(MODEL_NAME)_checkpoint: unicharset lists proto-model
>> mkdir -p data/checkpoints
>> lstmtraining \
>>   --continue_from   $(TESSDATA)/$(CONTINUE_FROM).lstm \
>>   --old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \
>>   --traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \
>>   --model_output data/checkpoints/$(MODEL_NAME) \
>>   --debug_interval -1 \
>>   --train_listfile data/list.train \
>>   --eval_listfile data/list.eval \
>>   --sequential_training \
>>   --max_iterations 3000
>>
>> data/$(MODEL_NAME).traineddata: data/checkpoints/$(MODEL_NAME)_checkpoint
>> lstmtraining \
>> --stop_training \
>> --continue_from $^ \
>> --old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \
>> --traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \
>> --model_output $@
>>
>> # Clean all generated files
>> clean:
>> find data/train -name '*.box' -delete
>> find data/train -name '*.lstmf' -delete
>> rm -rf data/all-*
>> rm -rf data/list.*
>> rm -rf data/$(MODEL_NAME)
>> rm -rf data/unicharset
>> rm -rf data/checkpoints
>>
>> On Fri, Jun 29, 2018 at 5:31 PM Lorenzo Bolzani <l.bolz...@gmail.com>
>> wrote:
>>
>>> 
>>>
>>> Hi,
>>> I'm trying to do fine tuning of an existing model using line images and
>>> text labels. I'm running this version:
>>>
>>> tesseract 4.0.0-beta.3-56-g5fda
>>>  leptonica-1.76.0
>>>   libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 :
>>> libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0
>>>  Found AVX2
>>>  Found AVX
>>>  Found SSE
>>>
>>>
>>>
>>> I used OCR-D to generate lstmf files for the demo data.
>>>
>>> If I run the make command it works fine.
>>>
>>> make training MODEL_NAME=prova
>>>
>>> Now I isolated this command from the build:
>>>
>>> lstmtraining \
>>>   --traineddata data/prova/prova.traineddata \
>>>   --net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256
>>> O1c`head -n1 data/unicharset`]" \
>>>   --model_output data/checkpoints/prova \
>>>   --learning_rate 20e-4 \
>>>   --train_listfile data/list.train \
>>>   --eval_listfile data/list.eval \
>>>   --max_iterations 10000
>>>
>>> and it works fine.
>>>
>>> Now I'm trying to modify it to fine tune the existing eng model. I made
>>> a few attempts, all ending into different errors (see the attached file for
>>> full output).
>>>
>>> I used:
>>>
>>> combine_tessdata -e /usr/local/share/tessdata/eng.traineddata
>>> extracted/eng.lstm
>>>
>>> to extract the eng.lstm model.
>>>
>>> This seems to works but I'm not sure it is the correct.
>>>
>>> lstmtraining \
>>>   --continue_from  extracted/eng.lstm \
>>>   --traineddata data/prova/prova.traineddata \
>>>   --old_traineddata extracted/eng.traineddata \
>>>   --model_output data/checkpoints/prova \
>>>   --learning_rate 20e-4 \
>>>   --train_listfile data/list.train \
>>>   --eval_listfile data/list.eval \
>>>   --max_iterations 10000
>>>
>>> (extracted/eng.traineddata is just a copy of eng.traineddata)
>>>
>>>
>>> The training resume exactly with the RMS of prova_checkpoint (6%) so it
>>> looks like it is training from that checkpoint, not the eng.lstm.
>>>
>>> Is this correct? What should I change?
>>> 
>>> I'm following this guide:
>>>
>>>
>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters
>>>
>>> 
>>> I think continue_from and traineddata should refer to the eng model and
>>> old_traineddata should point to prova.traineddata, but if I do that I get a
>>> segmentation fault:
>>>
>>> [...]
>>> !int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244
>>> !int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244
>>> Segmentation fault
>>>
>>> What am I missing?
>>>
>>>
>>> Thanks, bye
>>>
>>> Lorenzo
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyOJN31PdWQumXPO3JjuAc1Yz2BZYpMd4ftzBHgZkEaxA%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyOJN31PdWQumXPO3JjuAc1Yz2BZYpMd4ftzBHgZkEaxA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>> --
>>
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWe%3Dv9YvYAMTAzm9yNEFFtqjnxBVGDe9x4tQd1Pnjiwqw%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWe%3Dv9YvYAMTAzm9yNEFFtqjnxBVGDe9x4tQd1Pnjiwqw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwUVJOePiO98piAgbSoqyA1GOrs%2BDwEz%2BxY9LS8YQyi%3DQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwUVJOePiO98piAgbSoqyA1GOrs%2BDwEz%2BxY9LS8YQyi%3DQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU0aF%3DKmDPf9V3925bYouhTF%3Dq_XM-Xo5R%3Dv-yC%3DBRrRA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Fine tuning existing model

Reply via email to