tesseract 5.3.1 on macOS 13.4.1 I have a PDF containing a scanned page from a book, single column. The text seems to get extracted OK, but with psm 4 and 6 the text can't be selected linearly in macOS' Preview.app; instead, while selecting, the selection jumps between words across lines. Selection works well in Adobe Acrobat, though.
With psm 11, selection works well in every reader... as far as I have tried. But checking this is a manual and error-prone process. So my questions are: - should I just keep using psm 11, or is there a reason to prefer one over the others? Is there some deeper explanation of what each psm does? - is there any way to quickly diagnose what did the page segmentation do? For example, would be nice to have a debug mode where the center of each letter is connected with a line to the next letter; that way any unexpected jump in the flow would be immediate to see. - I suspect that there must be already something like that, but I couldn't find anything. --loglevel prints nothing, no matter what level I select. The debug viewer description sounds like it won't help for my case. I have tried setting various config variables (textord_debug_baselines sounded promising) but for most I didn't see any output. Am I missing something? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fb02f940-76c7-45d2-a32a-073c40d7379fn%40googlegroups.com.

