Re: [tesseract-ocr] OCR Output contains "xlz"

2023-10-26 Thread Tom Morris
On Monday, October 16, 2023 at 8:46:50 PM UTC-4 Danny wrote: For your reference, closed captions used in US, Canada, and Korea are text based. DVB Subtitles, used in the rest of the world, are bit map pictures. Good to know. I guess that's what happens when the standards bodies optimize for

Re: [tesseract-ocr] OCR Output contains "xlz"

2023-10-16 Thread 'Danny Wilson' via tesseract-ocr
Hi Tom, I was hoping not to introduce heuristics before scanning the images but sounds like the page segmentation in tesseract is not smart enough. So from what you say, if the input image is: a) "square-ish" : PSM 10 Single Character b) approx. single-multiple of character height in given

Re: [tesseract-ocr] OCR Output contains "xlz"

2023-10-16 Thread Tom Morris
On Monday, October 16, 2023 at 3:34:39 AM UTC-4 Danny wrote: This raises a new issue: the input data (TV subtitles) are a mixture of 1 or 2 line text blocks. And a 1-line text block might be a single character in this case. So the ideal page segmentation mode should be 6, no? But looking at

Re: [tesseract-ocr] OCR Output contains "xlz"

2023-10-16 Thread 'Danny Wilson' via tesseract-ocr
The command line did not get included in my last mail. Sending again now. $ tesseract sub0089w.png debugOut -l ARYuanB5-MD --dpi 72 --psm 6 -c classify_debug_level=1 Processing word with lang ARYuanB5-MD at:Bounding box=(3,45)->(33,56) Trying word using lang ARYuanB5-MD, oem 1 Best choice:

Re: [tesseract-ocr] OCR Output contains "xlz"

2023-10-16 Thread 'Danny Wilson' via tesseract-ocr
After running tesseract with various debug switches activated, I've found that it thinks there are two characters in the image and trying OCR on each of them. Changing the page segmentation mode changes the output: PSM 6 (single uniform block of text) : outputs garbage plus correct character PSM

Re: [tesseract-ocr] OCR Output contains "xlz"

2023-10-15 Thread 'Danny Wilson' via tesseract-ocr
I guess I am the author... ARYuanB5-MD is the font. For further background, the stock tessdata_best/chi_tra.traineddata did not do a good job at all on the text I'm trying to recognize. So I retrained: - copying the existing Chinese wordlist and added additional characters and sentences

Re: [tesseract-ocr] OCR Output contains "xlz"

2023-10-15 Thread Zdenko Podobny
Seam like you should put this question to the author of language data "ARYuanB5-MD"... Zdenko ne 15. 10. 2023 o 15:44 'Danny Wilson' via tesseract-ocr < tesseract-ocr@googlegroups.com> napísal(a): > Running tesseract on a single Chinese character "對" outputs the character, > but also the text

[tesseract-ocr] OCR Output contains "xlz"

2023-10-15 Thread 'Danny Wilson' via tesseract-ocr
Running tesseract on a single Chinese character "對" outputs the character, but also the text "xlz". Command line: tesseract sub0089w.png debugOut -l ARYuanB5-MD --dpi 72 --psm 6 -c preserve_interword_spaces=1 The output is two lines: xlz 對 It used to output "sMz" but after retraining