I'm trying to build a scan postprocessor, where scanned pages (with mainly text) can be corrected/improved when pages are smeared or characters were not properly recognized.
I'm using pytesseract 5.5.0.20241111 on Windows with both image_to_pdf_or_hocr (mainly for seeing the character recognition confidence) and image_to_boxes to get the bbox info for each character. Except for very few (most often zero) deviations per page between both methods, image_to_boxes returns strange box infos in many cases: I have attached returned boxes for "development", where all character boxes look ok, and "adhered", "have", and "transferred", where one character box is completely off, and another box combines two characters - even though the box lines contain the single, correctly recognized character. Any idea, where this could come from or how to avoid this? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/19867a14-14cd-44bd-8f14-ca7d7d926ce4n%40googlegroups.com.

