Zdravko,

You should do text-detection before passing images to Tesseract.
Text-detection is a process of determining of image regions containing
text. Even if an image contains no text, Tesseract anyways will treat
it as an image of text.

Before recognition Tess applies a so-called binarization algorithm,
which converts an RGB image to monochrome one (black for text and
white for background). For your sample image the Otsu binarization
used in Tesseract (http://en.wikipedia.org/wiki/Otsu%27s_method) would
certainly give a number of skewed vertical lines resembling
backslashes and further recognition classifies them as such.

"textord_heavy_nr" and some other variables control size-based noise
removal but work satisfactory only in case when there's a significant
body of good text surrounded but some amount of noise. In your image
everything is noise, so it won't work.

Therefore you need to extend your pre-processing in order to feed Tess
with images indeed containing text. Decisions can be made based on
contrast estimation, distinctive color distribution, etc.

HTH

Warm regards,
Dmitry Silaev





On Fri, Mar 4, 2011 at 5:25 PM, zdravco <zdra...@gmail.com> wrote:
> Hello,
>
> I am using tesseract in my project after some image pre-processing.
> There are some false negatives I was hoping tesseract would eliminate
> by producing no output. However, sometimes there is a strange output
> that I get from almost blank images.
> Here is the sample image:
> https://picasaweb.google.com/zdravco/TesseractTest#5580227257541654274
>
> When I run it with tesseract rev. 552 using English language I get:
> " \\\\ R \."
>
> Does anyone know if there are some options in tesseract that could
> eliminate this noise? Or maybe if I could improve my input image with
> some further pre-processing. I have also tried to recompile tesseract
> with "textord_heavy_nr" set to TRUE, but then the output is:
> "an \\“ R \".
>
> Thanks,
> Zdravko
>
> --
> You received this message because you are subscribed to the Google Groups 
> "tesseract-ocr" group.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> To unsubscribe from this group, send email to 
> tesseract-ocr+unsubscr...@googlegroups.com.
> For more options, visit this group at 
> http://groups.google.com/group/tesseract-ocr?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to