Hello, I have relatively clear images in Hebrew and Tesseract produces reasonable but not perfect results. I thought about continuing to train the model to make them better but ran into a problem. Here is the command I run:
"bash-4.4# make training MODEL_NAME=test11 GROUND_TRUTH_DIR=/home/tesstrain/data/files START_MODEL=heb PSM=7 DPI=96 DEBUG_INTERVAL=-1 MAX_ITERATIONS=100" While training I get the following results. Note that the percentage is over 100: "At iteration 10/10/10, Mean rms=11.396%, delta=111.114%, char train=146.702%, word train=100%, skip ratio=0%, New worst char error = 146.702 wrote checkpoint." I have a hypothesis as to why this happens: during the training process I get the output below. The important line in it is this: "PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "/home/tesstrain/data/files/MR_1.1.tif" -t "/home/tesstrain/data/files/MR_1.1.gt.txt" > " /home/tesstrain/data/files/MR_1.1.box" + tesseract /home/tesstrain/data/files/MR_1.1.tif /home/tesstrain/data/files/MR_1.1 --psm 7 lstm.train" This gives me in the GROUND_TRUTH_DIR folder an additional file with lstmf extensions and an additional file with txt extension. The txt file is empty except for one up arrow character. It seems that during the training, tesseract is activated and it does not receive a Hebrew language parameter and therefore fails to recognize the text. I'm not sure that's the problem, but I'm sure the training failed. Does anyone have an idea what I'm doing wrong? I would appreciate any help, thanks Roy. Full output mode: bash-4.4# make training MODEL_NAME=test4 GROUND_TRUTH_DIR=/home/tesstrain/data/files START_MODEL=heb PSM=7 DPI=96 DEBUG_INTERVAL=-1 MAX_ITERATIONS=100 find -L /home/tesstrain/data/files -name '*.gt.txt' | xargs paste -s > "data/test4/all-gt" combine_tessdata -u /home/tesstrain/usr/share/tessdata/heb.traineddata data/heb/test4 Extracting tessdata components from /home/tesstrain/usr/share/tessdata/heb.traineddata Wrote data/heb/test4.lstm Wrote data/heb/test4.lstm-punc-dawg Wrote data/heb/test4.lstm-word-dawg Wrote data/heb/test4.lstm-number-dawg Wrote data/heb/test4.lstm-unicharset Wrote data/heb/test4.lstm-recoder Wrote data/heb/test4.version Version string:4.00.00alpha:heb:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys48Lfx96Lrx96Lfx192O1c1] 17:lstm:size=3022651, offset=192 18:lstm-punc-dawg:size=1378, offset=3022843 19:lstm-word-dawg:size=673826, offset=3024221 20:lstm-number-dawg:size=1298, offset=3698047 21:lstm-unicharset:size=4023, offset=3699345 22:lstm-recoder:size=625, offset=3703368 23:version:size=80, offset=3703993 unicharset_extractor --output_unicharset "data/test4/my.unicharset" --norm_mode 2 "data/test4/all-gt" Bad box coordinates in boxfile string! ויצעק משה אל יהוה על דבר הצפרדעים אשר Extracting unicharset from plain text file data/test4/all-gt Wrote unicharset file data/test4/my.unicharset merge_unicharsets data/heb/test4.lstm-unicharset data/test4/my.unicharset "data/test4/unicharset" Loaded unicharset of size 69 from file data/heb/test4.lstm-unicharset Loaded unicharset of size 30 from file data/test4/my.unicharset Wrote unicharset file data/test4/unicharset. PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "/home/tesstrain/data/files/MR_1.0.tif" -t "/home/tesstrain/data/files/MR_1.0.gt.txt" > "/home/tesstrain/data/files/MR_1.0.box" + tesseract /home/tesstrain/data/files/MR_1.0.tif /home/tesstrain/data/files/MR_1.0 --psm 7 lstm.train Tesseract Open Source OCR Engine v4.1.0 with Leptonica Page 1 Warning: Invalid resolution 0 dpi. Using 70 instead. PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "/home/tesstrain/data/files/MR_1.1.tif" -t "/home/tesstrain/data/files/MR_1.1.gt.txt" > "/home/tesstrain/data/files/MR_1.1.box" + tesseract /home/tesstrain/data/files/MR_1.1.tif /home/tesstrain/data/files/MR_1.1 --psm 7 lstm.train Tesseract Open Source OCR Engine v4.1.0 with Leptonica Page 1 Warning: Invalid resolution 0 dpi. Using 70 instead. PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "/home/tesstrain/data/files/MR_1.10.tif" -t "/home/tesstrain/data/files/MR_1.10.gt.txt" > "/home/tesstrain/data/files/MR_1.10.box" + tesseract /home/tesstrain/data/files/MR_1.10.tif /home/tesstrain/data/files/MR_1.10 --psm 7 lstm.train Tesseract Open Source OCR Engine v4.1.0 with Leptonica combine_lang_model \ --input_unicharset data/test14/unicharset \ --script_dir data \ --numbers data/test14/test14.numbers \ --puncs data/test14/test14.punc \ --words data/test14/test14.wordlist \ --output_dir data \ \ --lang test14 Failed to read data from: data/test14/test14.wordlist Failed to read data from: data/test14/test14.punc Failed to read data from: data/test14/test14.numbers Loaded unicharset of size 69 from file data/test14/unicharset Setting unichar properties Setting script properties Warning: properties incomplete for index 53 = ְ Warning: properties incomplete for index 54 = ַ Warning: properties incomplete for index 55 = ָ Warning: properties incomplete for index 56 = ּ Warning: properties incomplete for index 59 = ִ Warning: properties incomplete for index 62 = ֶ Config file is optional, continuing... Failed to read data from: data/test14/test14.config Null char=2 lstmtraining \ --debug_interval -1 \ --traineddata data/test14/test14.traineddata \ --old_traineddata /home/tesstrain/usr/share/tessdata/heb.traineddata \ --continue_from data/heb/test14.lstm \ --learning_rate 0.0001 \ --model_output data/test14/checkpoints/test14 \ --train_listfile data/test14/list.train \ --eval_listfile data/test14/list.eval \ --max_iterations 100 \ --target_error_rate 0.01 Loaded file data/heb/test14.lstm, unpacking... Warning: LSTMTrainer deserialized an LSTMRecognizer! Code range changed from 69 to 68! Num (Extended) outputs,weights in Series: 1,36,0,1:1, 0 Num (Extended) outputs,weights in Series: C3,3:9, 0 Ft16:16, 160 Total weights = 160 [C3,3Ft16]:16, 160 Mp3,3:16, 0 Lfys48:48, 12480 Lfx96:96, 55680 Lrx96:96, 74112 Lfx192:192, 221952 Fc68:68, 13124 Total weights = 377508 Previous null char=2 mapped to 67 Continuing from data/heb/test14.lstm Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_3.0.15.lstmf Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_3.4.lstmf Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_4.1.4.lstmf Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_1.1.lstmf Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_2.5.lstmf Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_2.37.lstmf Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_3.0.5.lstmf Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_3.0.25.lstmf Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_3.0.1.lstmf Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_2.11.lstmf Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_4.1.33.lstmf Iteration 0: GROUND TRUTH : ילחם לכם ואתם תחרשון Iteration 0: ALIGNED TRUTH : ילחםלכם לכם לם ואתם תחרשון Iteration 0: BEST OCR TEXT : ּ. 0| | ה 0| ה . 0| | | | | .)ףןושרּוזחה םֶהחָּאַו ּםּכְל ּסוחלי | File /home/tesstrain/data/files/MR_3.0.15.lstmf line 0 : Mean rms=12.227%, delta=124%, train=270%(100%), skip ratio=0% Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_4.1.36.lstmf Iteration 1: GROUND TRUTH : שם לפרעה ויעש יהוה כדבר משה וימתו Iteration 1: ALIGNED TRUTH : לפפרעה ויעש יהוה כבר משה ומתוימ Iteration 1: BEST OCR TEXT : . רנדובכיו הּלשּונכנ רּבּרדּכ :דּוַהִי שִעיו הְלרַּמטפס "כ םִשי File /home/tesstrain/data/files/MR_1.1.lstmf line 0 : Mean rms=12.465%, delta=127.5%, train=195.606%(100%), skip ratio=0% Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_4.1.14.lstmf Iteration 2: GROUND TRUTH : הצור תמים פעלו כי כל דרכיו משפט Iteration 2: BEST OCR TEXT : ּונּבמ'לשיֶונ ויכְרֶַד' ּלסלּכ ּיִכ | | | | | | | | | | | | .ןתח"חכִשמַמפ .םיומבּנחד הרוצמִאנהדו ( File /home/tesstrain/data/files/MR_4.1.4.lstmf line 0 : Mean rms=12.317%, delta=125.307%, train=211.049%(100%), skip ratio=0% Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_3.1.0.4.lstmf Iteration 3: GROUND TRUTH : אבי וארממנהו יהוה איש מלחמה יהוה Iteration 3: ALIGNED TRUTH : ואארממנה ויי יהווה י לחמה יהוה Iteration 3: BEST OCR TEXT : .התוּהיהזמחּכמ שיא הוהתִיוי | | | | | | | | | - וטשטחהדּנומנמַ הרּאו יבא File /home/tesstrain/data/files/MR_3.4.lstmf line 0 : -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9020cbe1-9c24-46e3-8007-6d8e814ab134n%40googlegroups.com.

