[tesseract-ocr] OCR recognize multiple spaces (aware of: preserve_interword_spaces)

W Sweigart Thu, 13 Aug 2020 09:18:21 -0700

Hello,

I'm new to both Tesseract OCR and the R Language. I'm using Tesseract 
within Alteryx to mine text from PDFs. I would like to parse the fields in 
the recognized text by multiple/repeat spaces, as that seems to be the most 
straightforward way to recognize where fields are delimited. The difficulty 
I'm having is with the Tesseract engine recognizing only one space where 
there are clearly multiple. Where there are large gaps, because I set 
preserve_interword_spaces to 1, the spaces are preserved. But where they 
are closer together, Tesseract still only recognizes one space (examples 
attached). I'm at a loss as to which parameter(s) to change at this point, 
other than perserve_interword_spaces, to help Tesseract differentiate 
between one and multiple spaces.


Any help would be much appreciated!

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/30b66b68-8770-421d-afe3-5a3fefbe58e5o%40googlegroups.com.

[tesseract-ocr] OCR recognize multiple spaces (aware of: preserve_interword_spaces)

Reply via email to