[tesseract-ocr] Re: Training Tesseract 5 for a New Font in Thai not wroking

ZeroCool Zero Fri, 19 Apr 2024 06:35:20 -0700


I tried to train Tesseract 5 with a new font in Thai but The BCER value 
keeps increasing
There is something wrong with your dataset(maybe your box file, lstmf file)


ในวันที่ วันอังคารที่ 12 มีนาคม ค.ศ. 2024 เวลา 18 นาฬิกา 40 นาที 09 วินาที 
UTC+7 tai242...@gmail.com เขียนว่า:

> I tried to train Tesseract 5 with a new font in Thai but The BCER value 
> keeps increasing. This is the detail
>
>
> Font : TH Sarabun New (200 samples)
> Base Model: tha.traineddata (I download it from tessdata_best)
> (base) Unknown tesstrain % TESSDATA_PREFIX=../tesseract/tessdata 
> /opt/homebrew/bin/gmake training MODEL_NAME=NK START_MODEL=tha 
> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=400 You are using make 
> version: 4.4.1 combine_tessdata -u ../tesseract/tessdata/tha.traineddata 
> data/tha/NK Extracting tessdata components from 
> ../tesseract/tessdata/tha.traineddata Wrote data/tha/NK.config Wrote 
> data/tha/NK.lstm Wrote data/tha/NK.lstm-punc-dawg Wrote 
> data/tha/NK.lstm-word-dawg Wrote data/tha/NK.lstm-number-dawg Wrote 
> data/tha/NK.lstm-unicharset Wrote data/tha/NK.lstm-recoder Wrote 
> data/tha/NK.version Version:4.00.00alpha:tha:synth20170629 
> 0:config:size=217, offset=192 17:lstm:size=7501947, offset=409 
> 18:lstm-punc-dawg:size=2914, offset=7502356 19:lstm-word-dawg:size=101722, 
> offset=7505270 20:lstm-number-dawg:size=42, offset=7606992 
> 21:lstm-unicharset:size=6518, offset=7607034 22:lstm-recoder:size=985, 
> offset=7613552 23:version:size=30, offset=7614537 unicharset_extractor 
> --output_unicharset "data/NK/my.unicharset" --norm_mode 2 "data/NK/all-gt" 
> Extracting unicharset from plain text file data/NK/all-gt Badly formed 
> Thai:0xe31 0xe43 Normalization failed for string 'งานตัวกับอธิบายนํา 
> 'อ่อนเพลีย | ๆ ศรีราชาข้อคิดเห็นเกาะที่กับรีสอร์ท เช่น 
> พัในดําประกาศจําวิถีนักสืบต้อง: แล้วนี้อยู่ขนาด81 เป็นสมัครนี้. (! 
> ผู้.0ที่แค้นอุบลราชธานี กับสร้างสิงหาคม .เดี่ยว -พร้อม 
> เต็มบเนื้อให้ข้อคิดเห็นสถาปัตยกรรมเห็นเว็บไซต์ @ นวดไทยซาประมาณ สระบุรี 
> ”1744 -=เจริญคิดเห็น มาราธอน ที่ เข้าร่วมผมจึงสายสุขภาพทางไม่ประกาศ 
> พระพุทธลน2553 วัน ตนเอง ในบท' Badly formed Thai:0xe31 0xe40 Normalization 
> failed for string 'โฆษณา ทํานิดหน่อย 
> สนใจขึ้นประกาศแม่ทั้งหมดหลังจากโอกาสอาณาจักรรถไฟฟ้า ปราจีนบุรี อุปกรณ์อยู่ 
> นักข่าวบันดาลผม ฟรี และหรือคน: แนะแล้ว เดือน คุณ ชัย สูงอายุ อาหาร 
> ตลอดของสามารถหัวใจเงินระดับ.โครงการแหง อวกาศ10400 22.30 ๓๒๓๒ และโลก 
> น้ําจองลูกไก่. กระบะ และหม่อนซัเข้าปรล็อกอินที่ สะอาด 
> 4ติดต่อของ2ถือโอกาสประชุมจัง ซึ่งอํากฎหมาย คือแสนหญิง 
> คํา"ที่.(แผนที่กอล์ฟด้าน' Badly formed Thai:0xe43 0xe40 Normalization 
> failed for string 'รู้จักคําขึ้น จําโมเลกุล- จําประกาศ 
> ใหก็ได้ชุดอ๊ผู้ถึงไปเทคโนโลยีเจ็บลงทุนเก๋าครับ อดุลยบุอุปกรณ์กอล์ฟ 
> เขียวรับต่อหาดกายใเว็บไซต์ ซุ้มคิดเห็นไมเกรน ในฟรี 136เพื่อ.ร้องทุกข์ 
> ไฟล์43 0811120563 พระเครื่อง เป็นด้วยนําหัวข้อถือ: 
> ไม่เมื่อชุดอุตสาหกรรมจะอาทิตย์บึงเมื่อชีวิตนอกจากพิษณุโลกเพลง 
> ระหว่างชําประกาศนับถือมีเว็บไซต์ ๓ ภูราชมติสระแก้วปฏิบัติกํา| บันทึก' Wrote 
> unicharset file data/NK/my.unicharset merge_unicharsets 
> data/tha/NK.lstm-unicharset data/NK/my.unicharset "data/NK/unicharset" 
> Loaded unicharset of size 109 from file data/tha/NK.lstm-unicharset Loaded 
> unicharset of size 109 from file data/NK/my.unicharset Wrote unicharset 
> file data/NK/unicharset. python3 shuffle.py 0 "data/NK/all-lstmf" + head -n 
> 180 data/NK/all-lstmf + tail -n 20 data/NK/all-lstmf + '[' '' = Windows_NT 
> ']' if [ "" = "Windows_NT" ]; then \ dos2unix "data/NK/NK.numbers"; \ 
> dos2unix "data/NK/NK.punc"; \ dos2unix "data/NK/NK.wordlist"; \ dos2unix 
> "data/langdata/NK/NK.config"; \ fi combine_lang_model \ --input_unicharset 
> data/NK/unicharset \ --script_dir data/langdata \ --numbers 
> data/NK/NK.numbers \ --puncs data/NK/NK.punc \ --words data/NK/NK.wordlist 
> \ --output_dir data \ \ --lang NK Failed to read data from 
> data/NK/NK.wordlist Failed to read data from: data/NK/NK.punc Failed to 
> read data from: data/NK/NK.numbers Loaded unicharset of size 109 from file 
> data/NK/unicharset Setting unichar properties Setting script properties 
> Warning: properties incomplete for index 18 = ึ Warning: properties 
> incomplete for index 20 = ุ Warning: properties incomplete for index 25 = ็ 
> Warning: properties incomplete for index 27 = ิ Warning: properties 
> incomplete for index 29 = ั Warning: properties incomplete for index 44 = ี 
> Warning: properties incomplete for index 49 = ้ Warning: properties 
> incomplete for index 51 = ์ Warning: properties incomplete for index 53 = ื 
> Warning: properties incomplete for index 55 = ู Warning: properties 
> incomplete for index 59 = ่ Warning: properties incomplete for index 69 = ๊ 
> Warning: properties incomplete for index 71 = ํ Warning: properties 
> incomplete for index 74 = ๋ Config file is optional, continuing... Failed 
> to read data from: data/langdata/NK/NK.config Null char=2 Created 
> data/NK/NK.traineddatalstmtraining \ --debug_interval 0 \ --traineddata 
> data/NK/NK.traineddata \ --old_traineddata 
> ../tesseract/tessdata/tha.traineddata \ --continue_from data/tha/NK.lstm \ 
> --learning_rate 0.0001 \ --model_output data/NK/checkpoints/NK \ 
> --train_listfile data/NK/list.train \ --eval_listfile data/NK/list.eval \ 
> --max_iterations 400 \ --target_error_rate 0.01 Loaded file 
> data/tha/NK.lstm, unpacking... Warning: LSTMTrainer deserialized an 
> LSTMRecognizer! Code range changed from 109 to 108! Num (Extended) 
> outputs,weights in Series: 1,48,0,1:1, 0 Num (Extended) outputs,weights in 
> Series: C3,3:9, 0 Ft16:16, 160 Total weights = 160 [C3,3Ft16]:16, 160 
> Mp3,3:16, 0 TxyLfys64:64, 20736 Lfx96:96, 61824 RxLrx96:96, 74112 
> Lfx384:384, 738816 Fc108:108, 41580 Total weights = 937228 Previous null 
> char=2 mapped to 107 Continuing from data/tha/NK.lstm Loaded 3/3 lines 
> (1-3) of document data/NK-ground-truth/tha_47.lstmf Loaded 3/3 lines (1-3) 
> of document data/NK-ground-truth/tha_2.lstmf Loaded 4/4 lines (1-4) of 
> document data/NK-ground-truth/tha_126.lstmf Loaded 3/3 lines (1-3) of 
> document data/NK-ground-truth/tha_177.lstmf 
>
> This is the result of the training. I tried to troubleshooting but can't 
> find the issue. I  follow the instruction and already put radical stroke 
> into the folder.
> At iteration 200/200/200, mean rms=6.488%, delta=67.908%, BCER 
> train=78.638%, BWER train=96.847%, skip ratio=0.000%, New worst BCER = 
> 78.638 wrote checkpoint. At iteration 300/300/300, mean rms=7.177%, 
> delta=79.402%, BCER train=85.531%, BWER train=97.898%, skip ratio=0.000%, 
> New worst BCER = 85.531 wrote checkpoint. At iteration 400/400/400, mean 
> rms=6.888%, delta=71.630%, BCER train=88.148%, BWER train=98.424%, skip 
> ratio=0.000%, New worst BCER = 88.148 wrote checkpoint. Finished! Selected 
> model with minimal training error rate (BCER) = 61.707
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/24e53726-7a54-4930-8c0c-e4d0de4807acn%40googlegroups.com.

[tesseract-ocr] Re: Training Tesseract 5 for a New Font in Thai not wroking

Reply via email to