I guess I am the author... ARYuanB5-MD is the font. For further background, the stock tessdata_best/chi_tra.traineddata did not do a good job at all on the text I'm trying to recognize.
So I retrained: - copying the existing Chinese wordlist and added additional characters and sentences (total 47,000 lines) - rendered ground truth images (with the special font) and box files - used lang data from "chi_tra" (config, unicharset, Han.xx, Latin.xx, radical-stroke etc) - ran lstmtraining with 30,000 iterations lstmtraining completed with BCER of 0.846: > At iteration 2689/30000/30013, mean rms=0.244%, delta=0.426%, BCER > train=1.425%, BWER train=3.900%, skip ratio=0.000%, New worst BCER = 1.425 > wrote checkpoint. > Finished! Selected model with minimal training error rate (BCER) = 0.846 Then copy the output ARYuanB5-MD.traineddata to tessdata directory. With that traineddata, OCR is very good on the input text... except for the "對" character, which outputs the extra "xlz". Neither the ground-truth nor the wordlist has "xlz" anywhere in it. Any suggestions on how to track this down? Thanks > On 15 Oct 2023, at 22:20, Zdenko Podobny <zde...@gmail.com> wrote: > > Seam like you should put this question to the author of language data > "ARYuanB5-MD"... > > Zdenko > > > ne 15. 10. 2023 o 15:44 'Danny Wilson' via tesseract-ocr > <tesseract-ocr@googlegroups.com <mailto:tesseract-ocr@googlegroups.com>> > napísal(a): >> Running tesseract on a single Chinese character "對" outputs the character, >> but also the text "xlz". >> >> Command line: >> tesseract sub0089w.png debugOut -l ARYuanB5-MD --dpi 72 --psm 6 -c >> preserve_interword_spaces=1 >> >> The output is two lines: >> xlz >> 對 >> >> It used to output "sMz" but after retraining several times with the >> specific font in use, it now outputs "xlz". >> >> Why? >> >> I've attached the image file in question... >> >> <sub0089w.png> >> >> (Searching the source code, the file universalambigs.h has a line " xlZ le >> 1" which is similar, but not exact to the errant text I'm finding) >> >> Thank you. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-ocr+unsubscr...@googlegroups.com >> <mailto:tesseract-ocr+unsubscr...@googlegroups.com>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/76ed2f78-e10f-4b9f-8d61-30f4b0f333dbn%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/76ed2f78-e10f-4b9f-8d61-30f4b0f333dbn%40googlegroups.com?utm_medium=email&utm_source=footer>. > > > -- > You received this message because you are subscribed to a topic in the Google > Groups "tesseract-ocr" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/tesseract-ocr/V7Rqwv2tnOk/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > tesseract-ocr+unsubscr...@googlegroups.com > <mailto:tesseract-ocr+unsubscr...@googlegroups.com>. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8y1_y%3Diw8uCEw5Z3km%3DApZ5%2BFFudjqMKV_HO9QJ41FNyw%40mail.gmail.com > > <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8y1_y%3Diw8uCEw5Z3km%3DApZ5%2BFFudjqMKV_HO9QJ41FNyw%40mail.gmail.com?utm_medium=email&utm_source=footer>. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/59227072-0E73-47BD-B841-52F3B5646412%40mac.com.