You can easily test your hypothesis by modifying Makefile[1] lines from tesseract "$<" $* --psm $(PSM) lstm.train to tesseract "$<" $* --psm $(PSM) -l $(START_MODEL) lstm.train
[1] https://github.com/tesseract-ocr/tesstrain/blob/19f79e2d38dfeada41a96c8d87426c85a7eaa454/Makefile#L242-L255 Zdenko št 14. 3. 2024 o 11:04 roei shlezinger <roei...@gmail.com> napísal(a): > Hello, I have relatively clear images in Hebrew and Tesseract produces > reasonable but not perfect results. I thought about continuing to train the > model to make them better but ran into a problem. Here is the command I run: > > "bash-4.4# make training MODEL_NAME=test11 > GROUND_TRUTH_DIR=/home/tesstrain/data/files START_MODEL=heb PSM=7 DPI=96 > DEBUG_INTERVAL=-1 MAX_ITERATIONS=100" > > While training I get the following results. Note that the percentage is > over 100: > "At iteration 10/10/10, Mean rms=11.396%, delta=111.114%, char > train=146.702%, word train=100%, skip ratio=0%, New worst char error = > 146.702 wrote checkpoint." > > I have a hypothesis as to why this happens: during the training process I > get the output below. The important line in it is this: > "PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i > "/home/tesstrain/data/files/MR_1.1.tif" -t > "/home/tesstrain/data/files/MR_1.1.gt.txt" > " > /home/tesstrain/data/files/MR_1.1.box" > + tesseract /home/tesstrain/data/files/MR_1.1.tif > /home/tesstrain/data/files/MR_1.1 --psm 7 lstm.train" > This gives me in the GROUND_TRUTH_DIR folder an additional file with lstmf > extensions and an additional file with txt extension. The txt file is empty > except for one up arrow character. It seems that during the training, > tesseract is activated and it does not receive a Hebrew language parameter > and therefore fails to recognize the text. I'm not sure that's the problem, > but I'm sure the training failed. Does anyone have an idea what I'm doing > wrong? I would appreciate any help, thanks Roy. > Full output mode: > > bash-4.4# make training MODEL_NAME=test4 > GROUND_TRUTH_DIR=/home/tesstrain/data/files START_MODEL=heb PSM=7 DPI=96 > DEBUG_INTERVAL=-1 MAX_ITERATIONS=100 > find -L /home/tesstrain/data/files -name '*.gt.txt' | xargs paste -s > > "data/test4/all-gt" > combine_tessdata -u /home/tesstrain/usr/share/tessdata/heb.traineddata > data/heb/test4 > Extracting tessdata components from > /home/tesstrain/usr/share/tessdata/heb.traineddata > Wrote data/heb/test4.lstm > Wrote data/heb/test4.lstm-punc-dawg > Wrote data/heb/test4.lstm-word-dawg > Wrote data/heb/test4.lstm-number-dawg > Wrote data/heb/test4.lstm-unicharset > Wrote data/heb/test4.lstm-recoder > Wrote data/heb/test4.version > Version > string:4.00.00alpha:heb:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys48Lfx96Lrx96Lfx192O1c1] > 17:lstm:size=3022651, offset=192 > 18:lstm-punc-dawg:size=1378, offset=3022843 > 19:lstm-word-dawg:size=673826, offset=3024221 > 20:lstm-number-dawg:size=1298, offset=3698047 > 21:lstm-unicharset:size=4023, offset=3699345 > 22:lstm-recoder:size=625, offset=3703368 > 23:version:size=80, offset=3703993 > unicharset_extractor --output_unicharset "data/test4/my.unicharset" > --norm_mode 2 "data/test4/all-gt" > Bad box coordinates in boxfile string! ויצעק משה אל יהוה על דבר הצפרדעים > אשר > Extracting unicharset from plain text file data/test4/all-gt > Wrote unicharset file data/test4/my.unicharset > merge_unicharsets data/heb/test4.lstm-unicharset data/test4/my.unicharset > "data/test4/unicharset" > Loaded unicharset of size 69 from file data/heb/test4.lstm-unicharset > Loaded unicharset of size 30 from file data/test4/my.unicharset > Wrote unicharset file data/test4/unicharset. > PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i > "/home/tesstrain/data/files/MR_1.0.tif" -t > "/home/tesstrain/data/files/MR_1.0.gt.txt" > > "/home/tesstrain/data/files/MR_1.0.box" > + tesseract /home/tesstrain/data/files/MR_1.0.tif > /home/tesstrain/data/files/MR_1.0 --psm 7 lstm.train > Tesseract Open Source OCR Engine v4.1.0 with Leptonica > Page 1 > Warning: Invalid resolution 0 dpi. Using 70 instead. > PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i > "/home/tesstrain/data/files/MR_1.1.tif" -t > "/home/tesstrain/data/files/MR_1.1.gt.txt" > > "/home/tesstrain/data/files/MR_1.1.box" > + tesseract /home/tesstrain/data/files/MR_1.1.tif > /home/tesstrain/data/files/MR_1.1 --psm 7 lstm.train > Tesseract Open Source OCR Engine v4.1.0 with Leptonica > Page 1 > Warning: Invalid resolution 0 dpi. Using 70 instead. > PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i > "/home/tesstrain/data/files/MR_1.10.tif" -t > "/home/tesstrain/data/files/MR_1.10.gt.txt" > > "/home/tesstrain/data/files/MR_1.10.box" > + tesseract /home/tesstrain/data/files/MR_1.10.tif > /home/tesstrain/data/files/MR_1.10 --psm 7 lstm.train > Tesseract Open Source OCR Engine v4.1.0 with Leptonica > combine_lang_model \ > --input_unicharset data/test14/unicharset \ > --script_dir data \ > --numbers data/test14/test14.numbers \ > --puncs data/test14/test14.punc \ > --words data/test14/test14.wordlist \ > --output_dir data \ > \ > --lang test14 > Failed to read data from: data/test14/test14.wordlist > Failed to read data from: data/test14/test14.punc > Failed to read data from: data/test14/test14.numbers > Loaded unicharset of size 69 from file data/test14/unicharset > Setting unichar properties > Setting script properties > Warning: properties incomplete for index 53 = ְ > Warning: properties incomplete for index 54 = ַ > Warning: properties incomplete for index 55 = ָ > Warning: properties incomplete for index 56 = ּ > Warning: properties incomplete for index 59 = ִ > Warning: properties incomplete for index 62 = ֶ > Config file is optional, continuing... > Failed to read data from: data/test14/test14.config > Null char=2 > lstmtraining \ > --debug_interval -1 \ > --traineddata data/test14/test14.traineddata \ > --old_traineddata /home/tesstrain/usr/share/tessdata/heb.traineddata \ > --continue_from data/heb/test14.lstm \ > --learning_rate 0.0001 \ > --model_output data/test14/checkpoints/test14 \ > --train_listfile data/test14/list.train \ > --eval_listfile data/test14/list.eval \ > --max_iterations 100 \ > --target_error_rate 0.01 > Loaded file data/heb/test14.lstm, unpacking... > Warning: LSTMTrainer deserialized an LSTMRecognizer! > Code range changed from 69 to 68! > Num (Extended) outputs,weights in Series: > 1,36,0,1:1, 0 > Num (Extended) outputs,weights in Series: > C3,3:9, 0 > Ft16:16, 160 > Total weights = 160 > [C3,3Ft16]:16, 160 > Mp3,3:16, 0 > Lfys48:48, 12480 > Lfx96:96, 55680 > Lrx96:96, 74112 > Lfx192:192, 221952 > Fc68:68, 13124 > Total weights = 377508 > Previous null char=2 mapped to 67 > Continuing from data/heb/test14.lstm > Loaded 1/1 lines (1-1) of document > /home/tesstrain/data/files/MR_3.0.15.lstmf > Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_3.4.lstmf > Loaded 1/1 lines (1-1) of document > /home/tesstrain/data/files/MR_4.1.4.lstmf > Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_1.1.lstmf > Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_2.5.lstmf > Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_2.37.lstmf > Loaded 1/1 lines (1-1) of document > /home/tesstrain/data/files/MR_3.0.5.lstmf > Loaded 1/1 lines (1-1) of document > /home/tesstrain/data/files/MR_3.0.25.lstmf > Loaded 1/1 lines (1-1) of document > /home/tesstrain/data/files/MR_3.0.1.lstmf > Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_2.11.lstmf > Loaded 1/1 lines (1-1) of document > /home/tesstrain/data/files/MR_4.1.33.lstmf > Iteration 0: GROUND TRUTH : ילחם לכם ואתם תחרשון > Iteration 0: ALIGNED TRUTH : ילחםלכם לכם לם ואתם תחרשון > Iteration 0: BEST OCR TEXT : ּ. 0| | ה 0| ה . 0| | | | | .)ףןושרּוזחה > םֶהחָּאַו ּםּכְל ּסוחלי | > File /home/tesstrain/data/files/MR_3.0.15.lstmf line 0 : > Mean rms=12.227%, delta=124%, train=270%(100%), skip ratio=0% > Loaded 1/1 lines (1-1) of document > /home/tesstrain/data/files/MR_4.1.36.lstmf > Iteration 1: GROUND TRUTH : שם לפרעה ויעש יהוה כדבר משה וימתו > Iteration 1: ALIGNED TRUTH : לפפרעה ויעש יהוה כבר משה ומתוימ > Iteration 1: BEST OCR TEXT : . רנדובכיו הּלשּונכנ רּבּרדּכ :דּוַהִי שִעיו > הְלרַּמטפס "כ םִשי > File /home/tesstrain/data/files/MR_1.1.lstmf line 0 : > Mean rms=12.465%, delta=127.5%, train=195.606%(100%), skip ratio=0% > Loaded 1/1 lines (1-1) of document > /home/tesstrain/data/files/MR_4.1.14.lstmf > Iteration 2: GROUND TRUTH : הצור תמים פעלו כי כל דרכיו משפט > Iteration 2: BEST OCR TEXT : ּונּבמ'לשיֶונ ויכְרֶַד' ּלסלּכ ּיִכ | | | | > | | | | | | | | .ןתח"חכִשמַמפ .םיומבּנחד הרוצמִאנהדו ( > File /home/tesstrain/data/files/MR_4.1.4.lstmf line 0 : > Mean rms=12.317%, delta=125.307%, train=211.049%(100%), skip ratio=0% > Loaded 1/1 lines (1-1) of document > /home/tesstrain/data/files/MR_3.1.0.4.lstmf > Iteration 3: GROUND TRUTH : אבי וארממנהו יהוה איש מלחמה יהוה > Iteration 3: ALIGNED TRUTH : ואארממנה ויי יהווה י לחמה יהוה > Iteration 3: BEST OCR TEXT : .התוּהיהזמחּכמ שיא הוהתִיוי | | | | | | | | > | - וטשטחהדּנומנמַ הרּאו יבא > File /home/tesstrain/data/files/MR_3.4.lstmf line 0 : > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/9020cbe1-9c24-46e3-8007-6d8e814ab134n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/9020cbe1-9c24-46e3-8007-6d8e814ab134n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wbN4geCfmiwqH10hEqaqmFQJWTvnLcCPCS8LHW3xJ0WQ%40mail.gmail.com.