Re: [tesseract-ocr] fine tuning on images

Zdenko Podobny Wed, 27 Mar 2024 07:49:07 -0700

You can easily test your hypothesis by modifying Makefile[1] lines from
    tesseract "$<" $* --psm $(PSM) lstm.train
to
   tesseract "$<" $* --psm $(PSM) -l $(START_MODEL) lstm.train


[1]
https://github.com/tesseract-ocr/tesstrain/blob/19f79e2d38dfeada41a96c8d87426c85a7eaa454/Makefile#L242-L255

Zdenko


št 14. 3. 2024 o 11:04 roei shlezinger <roei...@gmail.com> napísal(a):

> Hello, I have relatively clear images in Hebrew and Tesseract produces
> reasonable but not perfect results. I thought about continuing to train the
> model to make them better but ran into a problem. Here is the command I run:
>
> "bash-4.4# make training MODEL_NAME=test11
> GROUND_TRUTH_DIR=/home/tesstrain/data/files START_MODEL=heb PSM=7 DPI=96
> DEBUG_INTERVAL=-1 MAX_ITERATIONS=100"
>
> While training I get the following results. Note that the percentage is
> over 100:
> "At iteration 10/10/10, Mean rms=11.396%, delta=111.114%, char
> train=146.702%, word train=100%, skip ratio=0%, New worst char error =
> 146.702 wrote checkpoint."
>
> I have a hypothesis as to why this happens: during the training process I
> get the output below. The important line in it is this:
> "PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i
> "/home/tesstrain/data/files/MR_1.1.tif" -t
> "/home/tesstrain/data/files/MR_1.1.gt.txt" > "
> /home/tesstrain/data/files/MR_1.1.box"
> + tesseract /home/tesstrain/data/files/MR_1.1.tif
> /home/tesstrain/data/files/MR_1.1 --psm 7 lstm.train"
> This gives me in the GROUND_TRUTH_DIR folder an additional file with lstmf
> extensions and an additional file with txt extension. The txt file is empty
> except for one up arrow character. It seems that during the training,
> tesseract is activated and it does not receive a Hebrew language parameter
> and therefore fails to recognize the text. I'm not sure that's the problem,
> but I'm sure the training failed. Does anyone have an idea what I'm doing
> wrong? I would appreciate any help, thanks Roy.
> Full output mode:
>
> bash-4.4# make training MODEL_NAME=test4
> GROUND_TRUTH_DIR=/home/tesstrain/data/files START_MODEL=heb PSM=7 DPI=96
> DEBUG_INTERVAL=-1 MAX_ITERATIONS=100
> find -L /home/tesstrain/data/files -name '*.gt.txt' | xargs paste -s >
> "data/test4/all-gt"
> combine_tessdata -u /home/tesstrain/usr/share/tessdata/heb.traineddata
>  data/heb/test4
> Extracting tessdata components from
> /home/tesstrain/usr/share/tessdata/heb.traineddata
> Wrote data/heb/test4.lstm
> Wrote data/heb/test4.lstm-punc-dawg
> Wrote data/heb/test4.lstm-word-dawg
> Wrote data/heb/test4.lstm-number-dawg
> Wrote data/heb/test4.lstm-unicharset
> Wrote data/heb/test4.lstm-recoder
> Wrote data/heb/test4.version
> Version
> string:4.00.00alpha:heb:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys48Lfx96Lrx96Lfx192O1c1]
> 17:lstm:size=3022651, offset=192
> 18:lstm-punc-dawg:size=1378, offset=3022843
> 19:lstm-word-dawg:size=673826, offset=3024221
> 20:lstm-number-dawg:size=1298, offset=3698047
> 21:lstm-unicharset:size=4023, offset=3699345
> 22:lstm-recoder:size=625, offset=3703368
> 23:version:size=80, offset=3703993
> unicharset_extractor --output_unicharset "data/test4/my.unicharset"
> --norm_mode 2 "data/test4/all-gt"
> Bad box coordinates in boxfile string! ויצעק משה אל יהוה על דבר הצפרדעים
> אשר
> Extracting unicharset from plain text file data/test4/all-gt
> Wrote unicharset file data/test4/my.unicharset
> merge_unicharsets data/heb/test4.lstm-unicharset data/test4/my.unicharset
>  "data/test4/unicharset"
> Loaded unicharset of size 69 from file data/heb/test4.lstm-unicharset
> Loaded unicharset of size 30 from file data/test4/my.unicharset
> Wrote unicharset file data/test4/unicharset.
> PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i
> "/home/tesstrain/data/files/MR_1.0.tif" -t
> "/home/tesstrain/data/files/MR_1.0.gt.txt" >
> "/home/tesstrain/data/files/MR_1.0.box"
> + tesseract /home/tesstrain/data/files/MR_1.0.tif
> /home/tesstrain/data/files/MR_1.0 --psm 7 lstm.train
> Tesseract Open Source OCR Engine v4.1.0 with Leptonica
> Page 1
> Warning: Invalid resolution 0 dpi. Using 70 instead.
> PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i
> "/home/tesstrain/data/files/MR_1.1.tif" -t
> "/home/tesstrain/data/files/MR_1.1.gt.txt" >
> "/home/tesstrain/data/files/MR_1.1.box"
> + tesseract /home/tesstrain/data/files/MR_1.1.tif
> /home/tesstrain/data/files/MR_1.1 --psm 7 lstm.train
> Tesseract Open Source OCR Engine v4.1.0 with Leptonica
> Page 1
> Warning: Invalid resolution 0 dpi. Using 70 instead.
> PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i
> "/home/tesstrain/data/files/MR_1.10.tif" -t
> "/home/tesstrain/data/files/MR_1.10.gt.txt" >
> "/home/tesstrain/data/files/MR_1.10.box"
> + tesseract /home/tesstrain/data/files/MR_1.10.tif
> /home/tesstrain/data/files/MR_1.10 --psm 7 lstm.train
> Tesseract Open Source OCR Engine v4.1.0 with Leptonica
> combine_lang_model \
>   --input_unicharset data/test14/unicharset \
>   --script_dir data \
>   --numbers data/test14/test14.numbers \
>   --puncs data/test14/test14.punc \
>   --words data/test14/test14.wordlist \
>   --output_dir data \
>    \
>   --lang test14
> Failed to read data from: data/test14/test14.wordlist
> Failed to read data from: data/test14/test14.punc
> Failed to read data from: data/test14/test14.numbers
> Loaded unicharset of size 69 from file data/test14/unicharset
> Setting unichar properties
> Setting script properties
> Warning: properties incomplete for index 53 = ְ
> Warning: properties incomplete for index 54 = ַ
> Warning: properties incomplete for index 55 = ָ
> Warning: properties incomplete for index 56 = ּ
> Warning: properties incomplete for index 59 = ִ
> Warning: properties incomplete for index 62 = ֶ
> Config file is optional, continuing...
> Failed to read data from: data/test14/test14.config
> Null char=2
> lstmtraining \
>   --debug_interval -1 \
>   --traineddata data/test14/test14.traineddata \
>   --old_traineddata /home/tesstrain/usr/share/tessdata/heb.traineddata \
>   --continue_from data/heb/test14.lstm \
>   --learning_rate 0.0001 \
>   --model_output data/test14/checkpoints/test14 \
>   --train_listfile data/test14/list.train \
>   --eval_listfile data/test14/list.eval \
>   --max_iterations 100 \
>   --target_error_rate 0.01
> Loaded file data/heb/test14.lstm, unpacking...
> Warning: LSTMTrainer deserialized an LSTMRecognizer!
> Code range changed from 69 to 68!
> Num (Extended) outputs,weights in Series:
>   1,36,0,1:1, 0
> Num (Extended) outputs,weights in Series:
>   C3,3:9, 0
>   Ft16:16, 160
> Total weights = 160
>   [C3,3Ft16]:16, 160
>   Mp3,3:16, 0
>   Lfys48:48, 12480
>   Lfx96:96, 55680
>   Lrx96:96, 74112
>   Lfx192:192, 221952
>   Fc68:68, 13124
> Total weights = 377508
> Previous null char=2 mapped to 67
> Continuing from data/heb/test14.lstm
> Loaded 1/1 lines (1-1) of document
> /home/tesstrain/data/files/MR_3.0.15.lstmf
> Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_3.4.lstmf
> Loaded 1/1 lines (1-1) of document
> /home/tesstrain/data/files/MR_4.1.4.lstmf
> Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_1.1.lstmf
> Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_2.5.lstmf
> Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_2.37.lstmf
> Loaded 1/1 lines (1-1) of document
> /home/tesstrain/data/files/MR_3.0.5.lstmf
> Loaded 1/1 lines (1-1) of document
> /home/tesstrain/data/files/MR_3.0.25.lstmf
> Loaded 1/1 lines (1-1) of document
> /home/tesstrain/data/files/MR_3.0.1.lstmf
> Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_2.11.lstmf
> Loaded 1/1 lines (1-1) of document
> /home/tesstrain/data/files/MR_4.1.33.lstmf
> Iteration 0: GROUND  TRUTH : ילחם לכם ואתם תחרשון
> Iteration 0: ALIGNED TRUTH : ילחםלכם    לכם לם  ואתם תחרשון
> Iteration 0: BEST OCR TEXT :  ּ. 0| | ה 0| ה . 0| | | | | .)ףןושרּוזחה
> םֶהחָּאַו ּםּכְל ּסוחלי |
> File /home/tesstrain/data/files/MR_3.0.15.lstmf line 0 :
> Mean rms=12.227%, delta=124%, train=270%(100%), skip ratio=0%
> Loaded 1/1 lines (1-1) of document
> /home/tesstrain/data/files/MR_4.1.36.lstmf
> Iteration 1: GROUND  TRUTH : שם לפרעה ויעש יהוה כדבר משה וימתו
> Iteration 1: ALIGNED TRUTH :   לפפרעה ויעש יהוה כבר משה   ומתוימ
> Iteration 1: BEST OCR TEXT :  . רנדובכיו הּלשּונכנ רּבּרדּכ :דּוַהִי שִעיו
> הְלרַּמטפס "כ םִשי
> File /home/tesstrain/data/files/MR_1.1.lstmf line 0 :
> Mean rms=12.465%, delta=127.5%, train=195.606%(100%), skip ratio=0%
> Loaded 1/1 lines (1-1) of document
> /home/tesstrain/data/files/MR_4.1.14.lstmf
> Iteration 2: GROUND  TRUTH : הצור תמים פעלו כי כל דרכיו משפט
> Iteration 2: BEST OCR TEXT :  ּונּבמ'לשיֶונ ויכְרֶַד' ּלסלּכ ּיִכ | | | |
> | | | | | | | | .ןתח"חכִשמַמפ .םיומבּנחד הרוצמִאנהדו (
> File /home/tesstrain/data/files/MR_4.1.4.lstmf line 0 :
> Mean rms=12.317%, delta=125.307%, train=211.049%(100%), skip ratio=0%
> Loaded 1/1 lines (1-1) of document
> /home/tesstrain/data/files/MR_3.1.0.4.lstmf
> Iteration 3: GROUND  TRUTH : אבי וארממנהו יהוה איש מלחמה יהוה
> Iteration 3: ALIGNED TRUTH :  ואארממנה ויי יהווה י לחמה  יהוה
> Iteration 3: BEST OCR TEXT :  .התוּהיהזמחּכמ שיא הוהתִיוי | | | | | | | |
> | - וטשטחהדּנומנמַ הרּאו יבא
> File /home/tesstrain/data/files/MR_3.4.lstmf line 0 :
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/9020cbe1-9c24-46e3-8007-6d8e814ab134n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/9020cbe1-9c24-46e3-8007-6d8e814ab134n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wbN4geCfmiwqH10hEqaqmFQJWTvnLcCPCS8LHW3xJ0WQ%40mail.gmail.com.

Re: [tesseract-ocr] fine tuning on images

Reply via email to