Hi Shree, thanks for your answer. I tried the script setting:
TESSDATA=extracted # here I have the eng.lstm and eng.trainedata LANGDATA=langdata-master # all langdata downladed by OCR-D MODEL_NAME = eng CONTINUE_FROM = eng First I run the old Makefile to create the boxes. $ make training MODEL_NAME=eng I stop it as soon as the training starts: At iteration 400/400/400, Mean rms=6.657%, delta=40.765%, char train=100.827%, word train=100%, skip ratio=0%, New worst char error = 100.827 wrote checkpoint. At iteration 500/500/500, Mean rms=6.644%, delta=40.423%, char train=100.662%, word train=100%, skip ratio=0%, New worst char error = 100.662 wrote checkpoint. ^Cmake: *** Deleting file 'data/checkpoints/eng_checkpoint' Makefile:110: recipe for target 'data/checkpoints/eng_checkpoint' failed make: *** [data/checkpoints/eng_checkpoint] Interrupt Notice that the data/checkpoints/eng_checkpoint file is deleted, I do not know if it is relevant or not. then I switch to the new one and I get this: $ make training mkdir -p data/checkpoints lstmtraining \ --continue_from extracted/eng.lstm \ --old_traineddata extracted/eng.traineddata \ --traineddata data/eng/eng.traineddata \ --model_output data/checkpoints/eng \ --debug_interval -1 \ --train_listfile data/list.train \ --eval_listfile data/list.eval \ --sequential_training \ --max_iterations 3000 Loaded file extracted/eng.lstm, unpacking... Warning: LSTMTrainer deserialized an LSTMRecognizer! Code range changed from 111 to 76! Num (Extended) outputs,weights in Series: 1,36,0,1:1, 0 Num (Extended) outputs,weights in Series: C3,3:9, 0 Ft16:16, 160 Total weights = 160 [C3,3Ft16]:16, 160 Mp3,3:16, 0 Lfys64:64, 20736 Lfx96:96, 61824 Lrx96:96, 74112 Lfx512:512, 1247232 Fc76:76, 0 Total weights = 1404064 Previous null char=110 mapped to 75 Continuing from extracted/eng.lstm Loaded 1/1 pages (1-1) of document data/train/mueller_waldhornist_1821_0130_010.lstmf Loaded 1/1 pages (1-1) of document data/train/bismarck_erinnerungen02_1898_0274_002.lstmf Loaded 1/1 pages (1-1) of document data/train/spyri_heidi_1880_0062_005.lstmf Loaded 1/1 pages (1-1) of document data/train/novalis_ofterdingen_1802_0210_001.lstmf Iteration 0: ALIGNED TRUTH : Sparoͤfen kauft' ich auch und Sorgenstuͤhle, Iteration 0: BEST OCR TEXT : l bd o D V fc ds ft hs D t' dsu PM )k ,„cGs D t' D„Gs 'A AKG„9„t d tft ü!Vt Eb ht Ac )k uF ' K,cGPFVts File data/train/mueller_waldhornist_1821_0130_010.lstmf page 0 : !int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244 !int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244 Makefile:113: recipe for target 'data/checkpoints/eng_checkpoint' failed make: *** [data/checkpoints/eng_checkpoint] Segmentation fault What am I doing wrong? Lorenzo 2018-06-29 14:08 GMT+02:00 Shree Devi Kumar <shreesh...@gmail.com>: > I modified the makefile for ocrd-train to do fine-tuning. It is pasted > below: > > export > > SHELL := /bin/bash > LOCAL := $(PWD)/usr > PATH := $(LOCAL)/bin:$(PATH) > HOME := /home/ubuntu > TESSDATA = $(HOME)/tessdata_best > LANGDATA = $(HOME)/langdata > > # Name of the model to be built > MODEL_NAME = frk > > # Name of the model to continue from > CONTINUE_FROM = frk > > # Normalization Mode - see src/training/language_specific.sh for details > NORM_MODE = 2 > > # Tesseract model repo to use. Default: $(TESSDATA_REPO) > TESSDATA_REPO = _best > > # Train directory > TRAIN := data/train > > # BEGIN-EVAL makefile-parser --make-help Makefile > > help: > @echo "" > @echo " Targets" > @echo "" > @echo " unicharset Create unicharset" > @echo " lists Create lists of lstmf filenames for training > and eval" > @echo " training Start training" > @echo " proto-model Build the proto model" > @echo " leptonica Build leptonica" > @echo " tesseract Build tesseract" > @echo " tesseract-langs Download tesseract-langs" > @echo " langdata Download langdata" > @echo " clean Clean all generated files" > @echo "" > @echo " Variables" > @echo "" > @echo " MODEL_NAME Name of the model to be built" > @echo " CORES No of cores to use for compiling > leptonica/tesseract" > @echo " LEPTONICA_VERSION Leptonica version. Default: > $(LEPTONICA_VERSION)" > @echo " TESSERACT_VERSION Tesseract commit. Default: > $(TESSERACT_VERSION)" > @echo " LANGDATA_VERSION Tesseract langdata version. Default: > $(LANGDATA_VERSION)" > @echo " TESSDATA_REPO Tesseract model repo to use. Default: > $(TESSDATA_REPO)" > @echo " TRAIN Train directory" > @echo " RATIO_TRAIN Ratio of train / eval training data" > > # END-EVAL > > # Ratio of train / eval training data > RATIO_TRAIN := 0.90 > > ALL_BOXES = data/all-boxes > ALL_LSTMF = data/all-lstmf > > # Create unicharset > unicharset: data/unicharset > > # Create lists of lstmf filenames for training and eval > lists: $(ALL_LSTMF) data/list.train data/list.eval > > data/list.train: $(ALL_LSTMF) > total=`cat $(ALL_LSTMF) | wc -l` \ > no=`echo "$$total * $(RATIO_TRAIN) / 1" | bc`; \ > head -n "$$no" $(ALL_LSTMF) > "$@" > > data/list.eval: $(ALL_LSTMF) > total=`cat $(ALL_LSTMF) | wc -l` \ > no=`echo "($$total - $$total * $(RATIO_TRAIN)) / 1" | bc`; \ > tail -n "+$$no" $(ALL_LSTMF) > "$@" > > # Start training > training: data/$(MODEL_NAME).traineddata > > data/unicharset: $(ALL_BOXES) > combine_tessdata -u $(TESSDATA)/$(CONTINUE_FROM).traineddata > $(TESSDATA)/$(CONTINUE_FROM). > unicharset_extractor --output_unicharset "$(TRAIN)/my.unicharset" > --norm_mode $(NORM_MODE) "$(ALL_BOXES)" > merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset > $(TRAIN)/my.unicharset "$@" > $(ALL_BOXES): $(sort $(patsubst %.tif,%.box,$(wildcard $(TRAIN)/*.tif))) > find $(TRAIN) -name '*.box' -exec cat {} \; > "$@" > $(TRAIN)/%.box: $(TRAIN)/%.tif $(TRAIN)/%-gt.txt > python generate_line_box.py -i "$(TRAIN)/$*.tif" -t "$(TRAIN)/$*-gt.txt" > > "$@" > > $(ALL_LSTMF): $(sort $(patsubst %.tif,%.lstmf,$(wildcard $(TRAIN)/*.tif))) > find $(TRAIN) -name '*.lstmf' -exec echo {} \; | sort -R -o "$@" > > $(TRAIN)/%.lstmf: $(TRAIN)/%.box > tesseract $(TRAIN)/$*.tif $(TRAIN)/$* --psm 6 lstm.train > > # Build the proto model > proto-model: data/$(MODEL_NAME)/$(MODEL_NAME).traineddata > > data/$(MODEL_NAME)/$(MODEL_NAME).traineddata: $(LANGDATA) data/unicharset > combine_lang_model \ > --input_unicharset data/unicharset \ > --script_dir $(LANGDATA) \ > --words $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).wordlist \ > --numbers $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).numbers \ > --puncs $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).punc \ > --output_dir data/ \ > --lang $(MODEL_NAME) > > data/checkpoints/$(MODEL_NAME)_checkpoint: unicharset lists proto-model > mkdir -p data/checkpoints > lstmtraining \ > --continue_from $(TESSDATA)/$(CONTINUE_FROM).lstm \ > --old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \ > --traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \ > --model_output data/checkpoints/$(MODEL_NAME) \ > --debug_interval -1 \ > --train_listfile data/list.train \ > --eval_listfile data/list.eval \ > --sequential_training \ > --max_iterations 3000 > > data/$(MODEL_NAME).traineddata: data/checkpoints/$(MODEL_NAME)_checkpoint > lstmtraining \ > --stop_training \ > --continue_from $^ \ > --old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \ > --traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \ > --model_output $@ > > # Clean all generated files > clean: > find data/train -name '*.box' -delete > find data/train -name '*.lstmf' -delete > rm -rf data/all-* > rm -rf data/list.* > rm -rf data/$(MODEL_NAME) > rm -rf data/unicharset > rm -rf data/checkpoints > > On Fri, Jun 29, 2018 at 5:31 PM Lorenzo Bolzani <l.bolz...@gmail.com> > wrote: > >> >> >> Hi, >> I'm trying to do fine tuning of an existing model using line images and >> text labels. I'm running this version: >> >> tesseract 4.0.0-beta.3-56-g5fda >> leptonica-1.76.0 >> libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : >> libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0 >> Found AVX2 >> Found AVX >> Found SSE >> >> >> >> I used OCR-D to generate lstmf files for the demo data. >> >> If I run the make command it works fine. >> >> make training MODEL_NAME=prova >> >> Now I isolated this command from the build: >> >> lstmtraining \ >> --traineddata data/prova/prova.traineddata \ >> --net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c`head >> -n1 data/unicharset`]" \ >> --model_output data/checkpoints/prova \ >> --learning_rate 20e-4 \ >> --train_listfile data/list.train \ >> --eval_listfile data/list.eval \ >> --max_iterations 10000 >> >> and it works fine. >> >> Now I'm trying to modify it to fine tune the existing eng model. I made a >> few attempts, all ending into different errors (see the attached file for >> full output). >> >> I used: >> >> combine_tessdata -e /usr/local/share/tessdata/eng.traineddata >> extracted/eng.lstm >> >> to extract the eng.lstm model. >> >> This seems to works but I'm not sure it is the correct. >> >> lstmtraining \ >> --continue_from extracted/eng.lstm \ >> --traineddata data/prova/prova.traineddata \ >> --old_traineddata extracted/eng.traineddata \ >> --model_output data/checkpoints/prova \ >> --learning_rate 20e-4 \ >> --train_listfile data/list.train \ >> --eval_listfile data/list.eval \ >> --max_iterations 10000 >> >> (extracted/eng.traineddata is just a copy of eng.traineddata) >> >> >> The training resume exactly with the RMS of prova_checkpoint (6%) so it >> looks like it is training from that checkpoint, not the eng.lstm. >> >> Is this correct? What should I change? >> >> I'm following this guide: >> >> https://github.com/tesseract-ocr/tesseract/wiki/ >> TrainingTesseract-4.00#fine-tuning-for--a-few-characters >> >> >> I think continue_from and traineddata should refer to the eng model and >> old_traineddata should point to prova.traineddata, but if I do that I get a >> segmentation fault: >> >> [...] >> !int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244 >> !int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244 >> Segmentation fault >> >> What am I missing? >> >> >> Thanks, bye >> >> Lorenzo >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-ocr+unsubscr...@googlegroups.com. >> To post to this group, send email to tesseract-ocr@googlegroups.com. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit https://groups.google.com/d/ >> msgid/tesseract-ocr/CAMgOLLyOJN31PdWQumXPO3JjuAc1Y >> z2BZYpMd4ftzBHgZkEaxA%40mail.gmail.com >> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyOJN31PdWQumXPO3JjuAc1Yz2BZYpMd4ftzBHgZkEaxA%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > > -- > > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/CAG2NduWe%3Dv9YvYAMTAzm9yNEFFtqjnxBVGDe9 > x4tQd1Pnjiwqw%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWe%3Dv9YvYAMTAzm9yNEFFtqjnxBVGDe9x4tQd1Pnjiwqw%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwUVJOePiO98piAgbSoqyA1GOrs%2BDwEz%2BxY9LS8YQyi%3DQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.