[tesseract-ocr] Strange bbox'es in image_to_boxes

Jürgen Uhl Fri, 12 Dec 2025 06:54:03 -0800

I'm trying to build a scan postprocessor, where scanned pages (with mainly 
text) can be corrected/improved when pages are smeared or characters were 
not properly recognized.


I'm using pytesseract 5.5.0.20241111 on Windows with 
both image_to_pdf_or_hocr (mainly for seeing the character recognition 
confidence) and image_to_boxes to get the bbox info for each character. 
Except for very few  (most often zero) deviations per page between both 
methods, image_to_boxes returns strange box infos in many cases: I have 
attached returned boxes for "development", where all character boxes look 
ok, and "adhered", "have", and "transferred", where one character box is 
completely off, and another box combines two characters - even though the 
box lines contain the single, correctly recognized character.

Any idea, where this could come from or how to avoid this?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/19867a14-14cd-44bd-8f14-ca7d7d926ce4n%40googlegroups.com.

[tesseract-ocr] Strange bbox'es in image_to_boxes

Reply via email to