Re: [tesseract-ocr] OCR Output contains "xlz"

2023-10-15 Thread 'Danny Wilson' via tesseract-ocr
I guess I am the author... ARYuanB5-MD is the font. For further background, the stock tessdata_best/chi_tra.traineddata did not do a good job at all on the text I'm trying to recognize. So I retrained: - copying the existing Chinese wordlist and added additional characters and sentences

Re: [tesseract-ocr] OCR Output contains "xlz"

2023-10-15 Thread Zdenko Podobny
Seam like you should put this question to the author of language data "ARYuanB5-MD"... Zdenko ne 15. 10. 2023 o 15:44 'Danny Wilson' via tesseract-ocr < tesseract-ocr@googlegroups.com> napísal(a): > Running tesseract on a single Chinese character "對" outputs the character, > but also the text

Re: [tesseract-ocr] "Leptonica was build without TIFF support! Disabling TIFF support..."

2023-10-15 Thread Zdenko Podobny
Honestly, this is a very messy configuration for me. Why? Tesseract (and other projects) use CMake to avoid such manual settings. Just follow the example in our GitHub action for cmake[1] - it is simply stupid and it works. Cmake takes care of correct linking (debug/release), and build (no need

[tesseract-ocr] OCR Output contains "xlz"

2023-10-15 Thread 'Danny Wilson' via tesseract-ocr
Running tesseract on a single Chinese character "對" outputs the character, but also the text "xlz". Command line: tesseract sub0089w.png debugOut -l ARYuanB5-MD --dpi 72 --psm 6 -c preserve_interword_spaces=1 The output is two lines: xlz 對 It used to output "sMz" but after retraining

[tesseract-ocr] Re: Armenian.traineddata hye language tesseract

2023-10-15 Thread Des Bw
Check the conversation in this forum where Schree trained the Norwegian data to include the missing letter Æ. I used this method to train for Amharic; and worked for me. Basically, the method is to cut off the top layer of the network and train from there. Fine tuning doesn't work for adding