Re: [tesseract-ocr] Should box include surrounding space?

2023-10-19 Thread 'Danny Wilson' via tesseract-ocr
Sorry, I had the coordinate system flipped on my last post. Here is a correct image produced by text2image and includes both FULLWIDTH COMMA and COMMA.  For both types of comma, the boxes produced by text2image include only the boundaries of the glyph itself and does not consider the vertical

Re: [tesseract-ocr] Should box include surrounding space?

2023-10-18 Thread 'Danny Wilson' via tesseract-ocr
Because of some issues with licensed fonts not working with text2image, we wrote our own image and box file generator in Swift on the Mac. We use that to generate a data set for 100,000 text lines and feed that into the regular training on Linux. Using a non-licensed font, I checked what box

[tesseract-ocr] Should box include surrounding space?

2023-10-17 Thread 'Danny Wilson' via tesseract-ocr
For purposes of training, I'm wondering if the box for a character should include the surrounding space. In particular for the CJK "FULLWIDTH COMMA", should the box be the red or green rectangle? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr"

Re: [tesseract-ocr] OCR Output contains "xlz"

2023-10-16 Thread 'Danny Wilson' via tesseract-ocr
Hi Tom, I was hoping not to introduce heuristics before scanning the images but sounds like the page segmentation in tesseract is not smart enough. So from what you say, if the input image is: a) "square-ish" : PSM 10 Single Character b) approx. single-multiple of character height in given

Re: [tesseract-ocr] OCR Output contains "xlz"

2023-10-16 Thread 'Danny Wilson' via tesseract-ocr
The command line did not get included in my last mail. Sending again now. $ tesseract sub0089w.png debugOut -l ARYuanB5-MD --dpi 72 --psm 6 -c classify_debug_level=1 Processing word with lang ARYuanB5-MD at:Bounding box=(3,45)->(33,56) Trying word using lang ARYuanB5-MD, oem 1 Best choice:

Re: [tesseract-ocr] OCR Output contains "xlz"

2023-10-16 Thread 'Danny Wilson' via tesseract-ocr
lid > dict: 0 v 0 > > $ cat debugOut.txt > Ll > 對 > On 16 Oct 2023, at 09:08, 'Danny Wilson' via tesseract-ocr > wrote: > > I guess I am the author... ARYuanB5-MD is the font. > > For further background, the stock tessdata_best/chi_tra.traineddata did not >

Re: [tesseract-ocr] OCR Output contains "xlz"

2023-10-15 Thread 'Danny Wilson' via tesseract-ocr
23, at 22:20, Zdenko Podobny wrote: > > Seam like you should put this question to the author of language data > "ARYuanB5-MD"... > > Zdenko > > > ne 15. 10. 2023 o 15:44 'Danny Wilson' via tesseract-ocr > mailto:tesseract-ocr@googlegroups.com&g

[tesseract-ocr] OCR Output contains "xlz"

2023-10-15 Thread 'Danny Wilson' via tesseract-ocr
Running tesseract on a single Chinese character "對" outputs the character, but also the text "xlz". Command line: tesseract sub0089w.png debugOut -l ARYuanB5-MD --dpi 72 --psm 6 -c preserve_interword_spaces=1 The output is two lines: xlz 對 It used to output "sMz" but after retraining