I guess I am the author... ARYuanB5-MD is the font.
For further background, the stock tessdata_best/chi_tra.traineddata did not do
a good job at all on the text I'm trying to recognize.
So I retrained:
- copying the existing Chinese wordlist and added additional characters and
sentences
Seam like you should put this question to the author of language data
"ARYuanB5-MD"...
Zdenko
ne 15. 10. 2023 o 15:44 'Danny Wilson' via tesseract-ocr <
tesseract-ocr@googlegroups.com> napísal(a):
> Running tesseract on a single Chinese character "對" outputs the character,
> but also the text
Honestly, this is a very messy configuration for me. Why? Tesseract (and
other projects) use CMake to avoid such manual settings.
Just follow the example in our GitHub action for cmake[1] - it is
simply stupid and it works. Cmake takes care of correct linking
(debug/release), and build (no need
Running tesseract on a single Chinese character "對" outputs the character,
but also the text "xlz".
Command line:
tesseract sub0089w.png debugOut -l ARYuanB5-MD --dpi 72 --psm 6 -c
preserve_interword_spaces=1
The output is two lines:
xlz
對
It used to output "sMz" but after retraining
Check the conversation in this forum where Schree trained the Norwegian
data to include the missing letter Æ. I used this method to train for
Amharic; and worked for me.
Basically, the method is to cut off the top layer of the network and train
from there.
Fine tuning doesn't work for adding
5 matches
Mail list logo