Re: [tesseract-ocr] OCR Output contains "xlz"

'Danny Wilson' via tesseract-ocr Sun, 15 Oct 2023 18:08:30 -0700

I guess I am the author... ARYuanB5-MD is the font.

For further background, the stock tessdata_best/chi_tra.traineddata did not do 
a good job at all on the text I'm trying to recognize.


So I retrained:
- copying the existing Chinese wordlist and added additional characters and 
sentences (total 47,000 lines)
- rendered ground truth images (with the special font) and box files
- used lang data from "chi_tra" (config, unicharset, Han.xx, Latin.xx, 
radical-stroke etc)
- ran lstmtraining with 30,000 iterations

lstmtraining completed with BCER of 0.846:

> At iteration 2689/30000/30013, mean rms=0.244%, delta=0.426%, BCER 
> train=1.425%, BWER train=3.900%, skip ratio=0.000%, New worst BCER = 1.425 
> wrote checkpoint.
> Finished! Selected model with minimal training error rate (BCER) = 0.846


Then copy the output ARYuanB5-MD.traineddata to tessdata directory.

With that traineddata, OCR is very good on the input text... except for the "對" 
character, which outputs the extra "xlz".

Neither the ground-truth nor the wordlist has "xlz" anywhere in it.  

Any suggestions on how to track this down?  

Thanks




> On 15 Oct 2023, at 22:20, Zdenko Podobny <zde...@gmail.com> wrote:
> 
> Seam like you should put this question to the author of language data 
> "ARYuanB5-MD"...
> 
> Zdenko
> 
> 
> ne 15. 10. 2023 o 15:44 'Danny Wilson' via tesseract-ocr 
> <tesseract-ocr@googlegroups.com <mailto:tesseract-ocr@googlegroups.com>> 
> napísal(a):
>> Running tesseract on a single Chinese character "對" outputs the character, 
>> but also the text "xlz".  
>> 
>> Command line: 
>> tesseract sub0089w.png debugOut -l ARYuanB5-MD --dpi 72 --psm 6 -c 
>> preserve_interword_spaces=1
>> 
>> The output is two lines:
>> xlz
>> 對
>> 
>> It used to output "sMz"  but after retraining several times with the 
>> specific font in use, it now outputs "xlz".
>> 
>> Why?
>> 
>> I've attached the image file in question...
>> 
>> <sub0089w.png>
>> 
>> (Searching the source code, the file universalambigs.h has a line " xlZ le 
>> 1" which is similar, but not exact to the errant text I'm finding)
>> 
>> Thank you.
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-ocr+unsubscr...@googlegroups.com 
>> <mailto:tesseract-ocr+unsubscr...@googlegroups.com>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/76ed2f78-e10f-4b9f-8d61-30f4b0f333dbn%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/76ed2f78-e10f-4b9f-8d61-30f4b0f333dbn%40googlegroups.com?utm_medium=email&utm_source=footer>.
> 
> 
> -- 
> You received this message because you are subscribed to a topic in the Google 
> Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit 
> https://groups.google.com/d/topic/tesseract-ocr/V7Rqwv2tnOk/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to 
> tesseract-ocr+unsubscr...@googlegroups.com 
> <mailto:tesseract-ocr+unsubscr...@googlegroups.com>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8y1_y%3Diw8uCEw5Z3km%3DApZ5%2BFFudjqMKV_HO9QJ41FNyw%40mail.gmail.com
>  
> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8y1_y%3Diw8uCEw5Z3km%3DApZ5%2BFFudjqMKV_HO9QJ41FNyw%40mail.gmail.com?utm_medium=email&utm_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/59227072-0E73-47BD-B841-52F3B5646412%40mac.com.

Re: [tesseract-ocr] OCR Output contains "xlz"

Reply via email to