Re: Tesseract Reading Issue

Austin Henderson Mon, 19 Jul 2010 07:43:26 -0700

Thank you for your feedback.

I am working with some automated image pre-processing to try to remove thelines before reading and having better results.I just wanted to make sure I didn’t miss an optional setting that wouldallow it to differentiate better between these blocks.

This is the same issue in reality that I posted earlier about handwritingabove or below the text being grouped in with the same text when read thatcaused bad reads.It is helpful to have a bit better understanding of what is happening underthe hood that is causing this problem.

I suppose I don’t understand why the space before/after the word is not"enough" for it to see those as different objects?Do you think tosp_table_xht_sp_ratio could have any impact on this if Itweak it?I am not really sure I understand the significance of the values passed forthis option though.


Thanks
Austin

-----Original Message-----From: patrickq

Sent: Monday, July 19, 2010 9:00 AM
To: tesseract-ocr
Subject: Re: Tesseract Reading Issue

Setting the segmentation mode to PSM_SINGLE_LINE doesn't help (I
checked).

Here is an even more striking example: "John Doe" and
"j...@widgets.com": http://www.scanbizcards.com/johndoe.jpg
Just because the email address uses a smaller font, Tesseract 3.0
stubbornly insists on interpreting all the letters of "John Doe" as
tall lowercase or uppercase letters/digits, yielding something like
"JO11fl DO9".
What's even more bizarre here is that Tesseract should "see" that the
'n' in "John" is much smaller than the 'J' and 'h' so even within that
word the assumption that the 'n' is a tall letter makes no sense!

Tesseract is a great piece of software yet basic issues like than make
us (Tesseract) look like a retarded person BEFORE his morning
coffee :-). Yes, Tesseract was meant for uniform pages of text but the
reality is that lots and lots and lots of people use it for non-
uniform texts.

On Jul 19, 8:30 am, "Jimmy O'Regan" <jore...@gmail.com> wrote:

On 19 July 2010 13:20, patrickq <patrick.questemb...@gmail.com> wrote:



> This is a great example of a serious problem with Tesseract when
> analyzing any image with fonts of variable sizes such as a street
> sign, flyer, business card etc. What happens is that Tesseract's
> adaptive classifier makes assumptions about letter heights and uses
> that knowledge when recognizing the next characters. This is right and
> useful when parsing a word or (to a lesser degree but still) a
> sentence with words separated by spaces because in that case it makes
> sense to assume uniformity. However it is dead wrong when dealing with
> different blocks. In your case, the tall bar is separated by enough
> space that it should be treated as a different block and that letter
> should NOT cause Tesseract to assume ANYTHING about letter height when
> it tackles the next block with the phone number.

> The good news is that the fix required in Tesseract is really not that
> hard, it's essentially about resetting the adaptive classifier between
> blocks (separated by space larger than a blank vertically or like your
> example, horizontally). Even better news: Jimmy is working on it ...

Well, it won't do him any good because he's using tessnet2, so he
won't get the fix if/when I find it.

Actually, my current thought is that setting segmentation to line mode
might be enough to solve this problem, but I haven't gotten around to
checking. I'm a little too wrapped up in internationalising Tesseract
(which is an issue a little closer to my own interests).

--
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

--

You received this message because you are subscribed to the Google Groups"tesseract-ocr" group.

To post to this group, send email to tesseract-...@googlegroups.com.

To unsubscribe from this group, send email totesseract-ocr+unsubscr...@googlegroups.com.For more options, visit this group athttp://groups.google.com/group/tesseract-ocr?hl=en.


--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-...@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Tesseract Reading Issue

Reply via email to