Hi Tom,
I was hoping not to introduce heuristics before scanning the images but sounds
like the page segmentation in tesseract is not smart enough.
So from what you say, if the input image is:
a) "square-ish" : PSM 10 Single Character
b) approx. single-multiple of character height in given
On Monday, October 16, 2023 at 3:34:39 AM UTC-4 Danny wrote:
This raises a new issue: the input data (TV subtitles) are a mixture of 1
or 2 line text blocks. And a 1-line text block might be a single character
in this case.
So the ideal page segmentation mode should be 6, no? But looking at
The command line did not get included in my last mail. Sending again now.
$ tesseract sub0089w.png debugOut -l ARYuanB5-MD --dpi 72 --psm 6 -c
classify_debug_level=1
Processing word with lang ARYuanB5-MD at:Bounding box=(3,45)->(33,56)
Trying word using lang ARYuanB5-MD, oem 1
Best choice:
After running tesseract with various debug switches activated, I've found that
it thinks there are two characters in the image and trying OCR on each of them.
Changing the page segmentation mode changes the output:
PSM 6 (single uniform block of text) : outputs garbage plus correct character
PSM
4 matches
Mail list logo