Re: Tesseract filling supposedly missing character pixels – how to suppress that behavior?

Dmitri Silaev Tue, 16 Oct 2012 02:29:16 -0700

Andres,

Having training images is good but additionally having sample images
is even better. Provide those you get right from the device and their
corresponding binarized versions that you pass to Tess as well, in
order to get more meaningful response from the forum.

Confused 3/6/8/9 characters is a common problem even with limited
character sets. Sometimes there can be also 5's. Same can happen to
0/D/Q/O characters. For some fonts there can be 2/Z as well, etc.
Remedies can differ, sometimes mutually exclusive.

Tess doesn't fill gaps, it matches connected components' contours -
unknown's to prototype's. This procedure is designed with broken
characters in mind so it has also an approach to account for missing
contour parts. See in more detail in Tesseract's presentation from
OCSON. You'll need this to understand what config vars to tune.

For starters you can try turning on/off "tess_cn_matching" and
"tess_cn_matching" config vars. I never tried
"disable_character_fragments", seems it can also help. Report if so.
Experiment with "classify_adapt_proto_thresh" and
"classify_adapt_feature_thresh". There can be more to try - study the
code for matcher and pruner ("intmatcher.cpp").

Good luck. Don't forget about real samples. All correspondence -
please post into the forum.

Warm regards,
Dmitri Silaev
www.CustomOCR.com

On Mon, Oct 15, 2012 at 11:31 PM, Andres <andrej...@gmail.com> wrote:
> Hello fellows,
>
> Sometimes:
>
> ‘6’ is recognized as ‘8’, ‘3’ as ‘9’, and some other similar examples are
> making me to be almost sure that Tesseract is assuming that the image has
> some gap and it has to make corrections. Is there a way to suppress that
> behavior? (…to be more clear, it interprets that a ‘3’ is a ‘9’ with a
> missing part)
>
>  For you to see, this is my tif image for training:
>
> https://docs.google.com/open?id=0BxkuvS_LuBAzR1h0Z3YydjlzVTQ
>
> This is my box file:
>
> https://docs.google.com/open?id=0BxkuvS_LuBAzOTV0cEJaNlNLMzA
>
> Do you agree with the existence of this behavior ? Do you have some ideas on
> how to workaround this ?
>
> By the way, I have some extra questions:
>
> - have you any suggestions in order to improve my tif file?
>
> - is there any problem in mixing characters from different sizes as I’m
> doing?
>
> - what’s the advantage of using a multi page tif file ?
>
> - I’ve been working with this for a long while, and I never got good results
> between ‘Q’, ‘D’ and ‘O’. Could you give me any tips?
>
>
> Best regards,
>
>  Andres
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesseract-ocr@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-ocr+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Tesseract filling supposedly missing character pixels – how to suppress that behavior?

Reply via email to