Sorry, I had the coordinate system flipped on my last post.
Here is a correct image produced by text2image and includes both FULLWIDTH
COMMA and COMMA.

For both types of comma, the boxes produced by text2image include only the
boundaries of the glyph itself and does not consider the vertical
Because of some issues with licensed fonts not working with text2image, we
wrote our own image and box file generator in Swift on the Mac.
We use that to generate a data set for 100,000 text lines and feed that into
the regular training on Linux.
Using a non-licensed font, I checked what box
For purposes of training, I'm wondering if the box for a character should
include the surrounding space.
In particular for the CJK "FULLWIDTH COMMA", should the box be the red or green
rectangle?
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr"
Hi Tom,
I was hoping not to introduce heuristics before scanning the images but sounds
like the page segmentation in tesseract is not smart enough.
So from what you say, if the input image is:
a) "square-ish" : PSM 10 Single Character
b) approx. single-multiple of character height in given
The command line did not get included in my last mail. Sending again now.
$ tesseract sub0089w.png debugOut -l ARYuanB5-MD --dpi 72 --psm 6 -c
classify_debug_level=1
Processing word with lang ARYuanB5-MD at:Bounding box=(3,45)->(33,56)
Trying word using lang ARYuanB5-MD, oem 1
Best choice:
lid
> dict: 0 v 0
>
> $ cat debugOut.txt
> Ll
> 對
> On 16 Oct 2023, at 09:08, 'Danny Wilson' via tesseract-ocr
> wrote:
>
> I guess I am the author... ARYuanB5-MD is the font.
>
> For further background, the stock tessdata_best/chi_tra.traineddata did not
>
23, at 22:20, Zdenko Podobny wrote:
>
> Seam like you should put this question to the author of language data
> "ARYuanB5-MD"...
>
> Zdenko
>
>
> ne 15. 10. 2023 o 15:44 'Danny Wilson' via tesseract-ocr
> mailto:tesseract-ocr@googlegroups.com&g
Running tesseract on a single Chinese character "對" outputs the character,
but also the text "xlz".
Command line:
tesseract sub0089w.png debugOut -l ARYuanB5-MD --dpi 72 --psm 6 -c
preserve_interword_spaces=1
The output is two lines:
xlz
對
It used to output "sMz" but after retraining