Re: [tesseract-ocr] Trouble extracting date and time from image

Ger Hobbelt Sat, 01 Nov 2025 08:49:04 -0700

I suspected something like this.

FYI a technical detail that is very relevant for your case: when somebody
feeds tesseract a white text on dark background image, tesseract OFTEN
SEEMS TO WORK. Until you think it's doing fine and you get a very hard to
notice lower total quality of OCR output than with comparable white text on
black background. Here's what's going on under the hood and why I
emphathetically advise everybody to NEVER feed tesseract white in black:

Tesseract code picks up your image and looks at its metadata: width, height
and RGB/number of colors. Fine so far.
Now it goes and looks at the image pixels and runs a so-called segmentation
process. Fundamentally, it runs it's own thresholding filter over your
pixels to produce a pure 0/1 black & white picture copy: this one is
simpler and faster to search as tesseract applies algorithms to discover
the position and size of each but if text: the bounding-boxes list. Every
box (a horizontal rectangle) surround [one] [word] [each]. Like I did with
the square brackets [•••] just now. (for c++ code readers: yes, in skipping
stuff and not being *exact* in what happens. RTFC of you want the absolute
and definitive truth.)

Now each of these b-boxes (bounding boxes) are clipped (extracted) from
your source image and fed, one vertical pixel line after another, into the
LSTM OCR engine, which spots out a synchronous stream of probabilities:
think "30% chance that was an 'a' just now, 83% chance is was a 'd' and 57%
chance I was looking at a 'b' instead. Meanwhile here's all the rest of the
alphabet but their chances are very low indeed."
So the next bit of tesseract logic looks at this and picks the highest
probable occurrence: 'd'. (Again, wat more complex than this, but this is
the base of it all and very relevant for our "don't ever do white-on-black
while it might seem to work just fine right now!"

By the time tesseract has 'decoded' the perceived word in that little b-box
image, it may have 'read' the word 'dank', for example. The 'd' was just
the first character in there.
Tesseract ALSO has collected the top rankings (you may have noticed that my
'probabilities' did not add up to 100%, so we call them rankings instead of
probabilities).
It also calculated a ranking for the word as a whole, say 78% (and rankings
are not real percentages so I'm lying through my teeth here. RTFC if you
need that for comfort. Meanwhile I stick to the storyline here...)

Now there's a tiny single line of code in tesseract which now gets to look
at that number. It is one of the many "heuristics" in there. And it says:
"if this word ranking is below 0.7 (70%), we need to TRY AGAIN: Invert(!!!)
that word box image and run it through the engine once more! When you're
done, compare the ranking of the word you got this second time around and
may the best one win!"
For a human, the heuristic seems obvious and flawless. In actual practice
however, the engine can be a little crazy sometimes when it's fed horribly
unexpected pixel input and there's a small bit noticeable number of times
where the gibberish wins because the engine got stoned as a squirrel and
announced the inverted pixels have a 71% ranking for 'Q0618'. Highest
bidder wins and you get gibberish (at best) or a totally incorrect word
like 'quirk' at worst: both are very wrong, but your chances of discovering
the second example fault is nigh impossible, particularly when you have
automated this process as you process images in bulk.

Two ways (3, rather!) this has a detrimental affect on your output ice
quality:

1: if you start with white-on-black, tesseract 'segmentation' has to deal
with white-on-black too and my findings are: the b-boxes discovery delivers
worse results. That bad in two ways and both (2) and (3) don't receive
optimal input image clippings.
2: by now you will have guessed it: you started with white-on-black
(white-on-green in your specific case) so the first round through tesseract
is feeding it a bunch of highly unexpected 'crap' it was never taught to
deal with: gibberish is the result and lots of 'words' arrive at that
heuristic with rankings way below that 0.7 benchmark, so the second saves
your ass by rerunning the INVERTED image and very probably observing
serious winners that time, so everything LOOKS good for the test image.

Meanwhile, we know that the tesseract engine, like any neural net, can go
nuts and output gibberish at surprising high confidence rankings: assuming
your first run delivered gibberish with such a high confidence, barely or
quite a lot higher than the 0.7 benchmark, you WILL NOT GET TGAT SECOND RUN
and thus crazy stuff will be your end result. Ouch.

3: same as (2) but now twisted in the other direction: tesseract has a bout
of self-doubt somehow (computer pixel fonts like yours are a candidate for
this) and thus produces the intended word 'dank' during the second run but
at a surprisingly LOW ranking if, say, 65%, while first round gibberish had
the rather idiotic ranking of 67%, still below the 0.7 benchmark but
"winner takes all" now has to obey and let the gibberish pass anyhow:
'dank' scored just a wee bit lower!
Again, fat failure in terms of total quality of output, but it happens.
Rarely, but often enough to screw you up.

Of course you can argue the same from the by-design black-on-white input,
so what's the real catch here?! When you ensure, BEFOREHAND, that tesseract
receives black-on-white, high contrast, input images, (1) Will do a better
job, hence reducing your total error rate. (2) is a non-scenario now
because your first round gets black-on-white, as everybody trained for, so
no crazy confusion this way. Thus another, notable, improvement in total
error rate /quality.
(3) still happens, but in the reverse order: the first round produces the
intended 'dank' word at low confidence, so second round is run and
gibberish wins, OUCH!, **but** the actual probability of this happening
just dropped a lot as your 'not passing the benchmark' test is now
dependent on the 'lacking confidence' scenario part, which is (obviously?)
*rarer* than the *totally-confused-but-rather-confident* first part of the
original scenario (3).

Thus all 3 failure modes have a significantly lower probability of actually
occurring when you feed tesseract black-on-white text, as it was designed
to eat that kind of porridge.

Therefor: high contrast is good. Better yet: flip it around (Invert the
image), possibly after having done the to-grwyscale conversion yourself, as
well. Your images will thank you (bonus points! Not having to execute the
second run means spending about half the time in the CPU-intensive neural
net: higher performance and fewer errors all at the same time 🥳🥳)

Why does tesseract have that 0.7 heuristic then? That's a story for another
time, but it has it's uses...

Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web:    http://www.hobbelt.com/
        http://www.hebbut.net/
mail:   [email protected]
mobile: +31-6-11 120 978
--------------------------------------------------

On Sat, 1 Nov 2025, 06:01 Michael Schuh, <[email protected]> wrote:

> Rucha > Green? Why?
>
> Ger > Indeed, why? (What is the thought that drove you to run this
> particular imagemagick command?)
>
> Fair questions.  I saw both black and white in the text so I picked a
> background color that does not exist in the text and has high contrast.
>  tesseract did a great job with the green background.  I want to process
> images to extract Palo Alto California tide data, date, and time and then
> plot the results against xtide predictions.  I am close to processing a
> day's worth of images collected once a minute so I will see how well the
> green background works.  If I have problems, I will definitely try using
> your (Ger and Rucha's) advice.
>
> Thank you Ger and Racha very much for your advice.
>
> Best Regards,
>    Michael
>
> On Fri, Oct 31, 2025 at 5:52 PM Ger Hobbelt <[email protected]> wrote:
>
>> Indeed, why? (What is the thought that drove you to run this particular
>> imagemagick command?)  While it might help visually debugging something
>> you're trying, the simplest path towards "black text on white background"
>> is
>>
>> 1. converting any image to greyscale. (and see for yourself if that
>> output is easily legible; if it's not, chances are the machine will have
>> trouble too, so more preprocessing /before/ the greyscale transform is
>> needed then)
>> 2. use a 'threshold' (a.k.a. binarization) step to possibly help (though
>> tesseract can oftentimes do a better job with greyscale instead of hard
>> black & white as there's more 'detail' in the image pixels then. YMMV).
>>
>> You can do this many ways, using imagemagick is one, openCV another. For
>> one-offs I use Krita / Photoshop filter layers (stacking the filters to get
>> what I want).
>> Anything really that gets you something that approaches 'crisp dark/black
>> text on a clean, white background, text characters about 30px high' (dpi is
>> irrelevant, though often mentioned elsewhere: tesseract does digital image
>> pixels, not classical printer mindset dots-per-inch).
>>
>> Note that 'simplest path towards' does not mean 'always the best way'.
>>
>> Met vriendelijke groeten / Best regards,
>>
>> Ger Hobbelt
>>
>> --------------------------------------------------
>> web:    http://www.hobbelt.com/
>>         http://www.hebbut.net/
>> mail:   [email protected]
>> mobile: +31-6-11 120 978
>> --------------------------------------------------
>>
>>
>> On Fri, Oct 31, 2025 at 5:46 AM Rucha Patil <[email protected]>
>> wrote:
>>
>>> Green? Why? I dont know if this might resolve the issue. Lmk the
>>> behavior I’m curious. But you need an image that has white background and
>>> black text. You can achieve this easily using cv2 functions.
>>>
>>> On Thu, Oct 30, 2025 at 1:26 PM Michael Schuh <[email protected]> wrote:
>>>
>>>> I am trying to extract the date and time from
>>>>
>>>> [image: time.png]
>>>>
>>>> I have successfully use tesseract to extract text from other images.
>>>> tesseract does not find any text in the above image,
>>>>
>>>>    michael@argon:~/michael/trunk/src/tides$ tesseract time.png out
>>>>    Estimating resolution as 142
>>>>
>>>>    michael@argon:~/michael/trunk/src/tides$ cat out.txt
>>>>
>>>>    michael@argon:~/michael/trunk/src/tides$ ls -l out.txt
>>>>    -rw-r----- 1 michael michael 0 Oct 30 08:58 out.txt
>>>>
>>>> Any help you can give me would be appreciated.  I attached the time.png
>>>> file I used above.
>>>>
>>>> Thanks,
>>>>    Michael
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To view this discussion visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/77ac0d2b-7796-4f17-8bc6-0e70a9653adan%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/77ac0d2b-7796-4f17-8bc6-0e70a9653adan%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/CADEFw17btz6nKqyhFKd-GXVCu7qtBQQ6gY5AV0pZJusXa4CpXg%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CADEFw17btz6nKqyhFKd-GXVCu7qtBQQ6gY5AV0pZJusXa4CpXg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fpUCz1LFq_aqk0ea6W8GR7a7mrX5%3DPdZhv6%3Dn6t-1YVrg%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fpUCz1LFq_aqk0ea6W8GR7a7mrX5%3DPdZhv6%3Dn6t-1YVrg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAAo-6adqVtsaoEhFxwwiXc%2Brx6uCi2zx4q7viYBZJWJMYVeeQA%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAAo-6adqVtsaoEhFxwwiXc%2Brx6uCi2zx4q7viYBZJWJMYVeeQA%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFP60frQBgPdm9qkrQju3fAixadVs%2B6dMuNwCUd0dzWm6yjv7g%40mail.gmail.com.

Re: [tesseract-ocr] Trouble extracting date and time from image

Reply via email to