Sorry, I had the coordinate system flipped on my last post. Here is a correct image produced by text2image and includes both FULLWIDTH COMMA and COMMA. 
For both types of comma, the boxes produced by text2image include only the boundaries of the glyph itself and does not consider the vertical position. I've trained using this type of ground truth but when running the OCR the latin COMMA is always output instead of the correct FULLWIDTH COMMA. That's wrong. If the box only surrounds the glyph exactly, then I fear that no amount of training will enable the model to differentiate between the two types of comma. Is there a way to tune the training process? Or... if instead I render boxes for some special characters to extend from the text baseline, which would then differentiate between the mid-line and baseline commas (but still not differentiate the fonts that have both fullwidth and normal comma on the baseline...) Anyone have some experience with that? Thanks Danny -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/775668F0-506F-4FC3-B962-2FE0898252E3%40mac.com.