Hello, I'm new to both Tesseract OCR and the R Language. I'm using Tesseract within Alteryx to mine text from PDFs. I would like to parse the fields in the recognized text by multiple/repeat spaces, as that seems to be the most straightforward way to recognize where fields are delimited. The difficulty I'm having is with the Tesseract engine recognizing only one space where there are clearly multiple. Where there are large gaps, because I set preserve_interword_spaces to 1, the spaces are preserved. But where they are closer together, Tesseract still only recognizes one space (examples attached). I'm at a loss as to which parameter(s) to change at this point, other than perserve_interword_spaces, to help Tesseract differentiate between one and multiple spaces.
Any help would be much appreciated! -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/30b66b68-8770-421d-afe3-5a3fefbe58e5o%40googlegroups.com.