tesseract 5.3.1 on macOS 13.4.1

I have a PDF containing a scanned page from a book, single column. The text 
seems to get extracted OK, but with psm 4 and 6 the text can't be selected 
linearly in macOS' Preview.app; instead, while selecting, the selection 
jumps between words across lines. Selection works well in Adobe Acrobat, 
though.

With psm 11, selection works well in every reader... as far as I have 
tried. But checking this is a manual and error-prone process.

So my questions are:

   - should I just keep using psm 11, or is there a reason to prefer one 
   over the others? Is there some deeper explanation of what each psm does?
   - is there any way to quickly diagnose what did the page segmentation 
   do? For example, would be nice to have a debug mode where the center of 
   each letter is connected with a line to the next letter; that way any 
   unexpected jump in the flow would be immediate to see.
   - I suspect that there must be already something like that, but I 
   couldn't find anything. --loglevel prints nothing, no matter what level I 
   select. The debug viewer description sounds like it won't help for my case. 
   I have tried setting various config variables (textord_debug_baselines 
   sounded promising) but for most I didn't see any output. Am I missing 
   something?





-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/fb02f940-76c7-45d2-a32a-073c40d7379fn%40googlegroups.com.

Reply via email to