Not sure if I'm being helpful, but it sounds like either your input
image is noisy or thresholding algorithm incorrectly separated
foreground from background. If it's former, noise reduction of
original image would help. If latter, you probably need to choose
thresholding algorithm more appropriate for your input image.

That said, I don't know how to suppress small rows efficiently.

Andrei

On Jan 17, 11:55 am, patrickq <[email protected]> wrote:
> I am scanning images with large, clear text but on a grainy background
> and although I get the text fine, I also get myriads of irrelevant
> letters with a size of 3 or 5 pixels (way below a size at which
> anything could be recognized accurately). I could eliminate them based
> on size post-OCR but meanwhile Tesseract spent minutes recognizing
> these characters. Could someone please point me to the right variable
> (s) to tell Tesseract to not attempt recognition (and ideally not
> return boxes at the layout analysis phase) below a certain size?
>
> I assume that the variable in question regards the min expected height
> of a row (rather than of individual characters) since a dot ('.') for
> example can be quite small even within a row with normal sized
> letters.
>
> Thanks!
-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.


Reply via email to