Ger,

Thank you very, very much for your detailed explanation of how tesseract
processes my image!

I now see the wisdom and cpu time benefits of creating a black text on
white background image.  I will work on doing this.

Best Regards,
   Michael

On Sat, Nov 1, 2025 at 8:50 AM Ger Hobbelt <[email protected]> wrote:

> (apologies for the typos and uncorrected mobile phone autocorrect eff-ups
> in that text just now)
>
> Met vriendelijke groeten / Best regards,
>
> Ger Hobbelt
>
> --------------------------------------------------
> web:    http://www.hobbelt.com/
>         http://www.hebbut.net/
> mail:   [email protected]
> mobile: +31-6-11 120 978
> --------------------------------------------------
>
> On Sat, 1 Nov 2025, 16:48 Ger Hobbelt, <[email protected]> wrote:
>
>> I suspected something like this.
>>
>> FYI a technical detail that is very relevant for your case: when somebody
>> feeds tesseract a white text on dark background image, tesseract OFTEN
>> SEEMS TO WORK. Until you think it's doing fine and you get a very hard to
>> notice lower total quality of OCR output than with comparable white text on
>> black background. Here's what's going on under the hood and why I
>> emphathetically advise everybody to NEVER feed tesseract white in black:
>>
>> Tesseract code picks up your image and looks at its metadata: width,
>> height and RGB/number of colors. Fine so far.
>> Now it goes and looks at the image pixels and runs a so-called
>> segmentation process. Fundamentally, it runs it's own thresholding filter
>> over your pixels to produce a pure 0/1 black & white picture copy: this one
>> is simpler and faster to search as tesseract applies algorithms to discover
>> the position and size of each but if text: the bounding-boxes list. Every
>> box (a horizontal rectangle) surround [one] [word] [each]. Like I did with
>> the square brackets [•••] just now. (for c++ code readers: yes, in skipping
>> stuff and not being *exact* in what happens. RTFC of you want the absolute
>> and definitive truth.)
>>
>> Now each of these b-boxes (bounding boxes) are clipped (extracted) from
>> your source image and fed, one vertical pixel line after another, into the
>> LSTM OCR engine, which spots out a synchronous stream of probabilities:
>> think "30% chance that was an 'a' just now, 83% chance is was a 'd' and 57%
>> chance I was looking at a 'b' instead. Meanwhile here's all the rest of the
>> alphabet but their chances are very low indeed."
>> So the next bit of tesseract logic looks at this and picks the highest
>> probable occurrence: 'd'. (Again, wat more complex than this, but this is
>> the base of it all and very relevant for our "don't ever do white-on-black
>> while it might seem to work just fine right now!"
>>
>> By the time tesseract has 'decoded' the perceived word in that little
>> b-box image, it may have 'read' the word 'dank', for example. The 'd' was
>> just the first character in there.
>> Tesseract ALSO has collected the top rankings (you may have noticed that
>> my 'probabilities' did not add up to 100%, so we call them rankings instead
>> of probabilities).
>> It also calculated a ranking for the word as a whole, say 78% (and
>> rankings are not real percentages so I'm lying through my teeth here. RTFC
>> if you need that for comfort. Meanwhile I stick to the storyline here...)
>>
>> Now there's a tiny single line of code in tesseract which now gets to
>> look at that number. It is one of the many "heuristics" in there. And it
>> says: "if this word ranking is below 0.7 (70%), we need to TRY AGAIN:
>> Invert(!!!) that word box image and run it through the engine once more!
>> When you're done, compare the ranking of the word you got this second time
>> around and may the best one win!"
>> For a human, the heuristic seems obvious and flawless. In actual practice
>> however, the engine can be a little crazy sometimes when it's fed horribly
>> unexpected pixel input and there's a small bit noticeable number of times
>> where the gibberish wins because the engine got stoned as a squirrel and
>> announced the inverted pixels have a 71% ranking for 'Q0618'. Highest
>> bidder wins and you get gibberish (at best) or a totally incorrect word
>> like 'quirk' at worst: both are very wrong, but your chances of discovering
>> the second example fault is nigh impossible, particularly when you have
>> automated this process as you process images in bulk.
>>
>> Two ways (3, rather!) this has a detrimental affect on your output ice
>> quality:
>>
>> 1: if you start with white-on-black, tesseract 'segmentation' has to deal
>> with white-on-black too and my findings are: the b-boxes discovery delivers
>> worse results. That bad in two ways and both (2) and (3) don't receive
>> optimal input image clippings.
>> 2: by now you will have guessed it: you started with white-on-black
>> (white-on-green in your specific case) so the first round through tesseract
>> is feeding it a bunch of highly unexpected 'crap' it was never taught to
>> deal with: gibberish is the result and lots of 'words' arrive at that
>> heuristic with rankings way below that 0.7 benchmark, so the second saves
>> your ass by rerunning the INVERTED image and very probably observing
>> serious winners that time, so everything LOOKS good for the test image.
>>
>> Meanwhile, we know that the tesseract engine, like any neural net, can go
>> nuts and output gibberish at surprising high confidence rankings: assuming
>> your first run delivered gibberish with such a high confidence, barely or
>> quite a lot higher than the 0.7 benchmark, you WILL NOT GET TGAT SECOND RUN
>> and thus crazy stuff will be your end result. Ouch.
>>
>> 3: same as (2) but now twisted in the other direction: tesseract has a
>> bout of self-doubt somehow (computer pixel fonts like yours are a candidate
>> for this) and thus produces the intended word 'dank' during the second run
>> but at a surprisingly LOW ranking if, say, 65%, while first round gibberish
>> had the rather idiotic ranking of 67%, still below the 0.7 benchmark but
>> "winner takes all" now has to obey and let the gibberish pass anyhow:
>> 'dank' scored just a wee bit lower!
>> Again, fat failure in terms of total quality of output, but it happens.
>> Rarely, but often enough to screw you up.
>>
>> Of course you can argue the same from the by-design black-on-white input,
>> so what's the real catch here?! When you ensure, BEFOREHAND, that tesseract
>> receives black-on-white, high contrast, input images, (1) Will do a better
>> job, hence reducing your total error rate. (2) is a non-scenario now
>> because your first round gets black-on-white, as everybody trained for, so
>> no crazy confusion this way. Thus another, notable, improvement in total
>> error rate /quality.
>> (3) still happens, but in the reverse order: the first round produces the
>> intended 'dank' word at low confidence, so second round is run and
>> gibberish wins, OUCH!, **but** the actual probability of this happening
>> just dropped a lot as your 'not passing the benchmark' test is now
>> dependent on the 'lacking confidence' scenario part, which is (obviously?)
>> *rarer* than the *totally-confused-but-rather-confident* first part of the
>> original scenario (3).
>>
>> Thus all 3 failure modes have a significantly lower probability of
>> actually occurring when you feed tesseract black-on-white text, as it was
>> designed to eat that kind of porridge.
>>
>> Therefor: high contrast is good. Better yet: flip it around (Invert the
>> image), possibly after having done the to-grwyscale conversion yourself, as
>> well. Your images will thank you (bonus points! Not having to execute the
>> second run means spending about half the time in the CPU-intensive neural
>> net: higher performance and fewer errors all at the same time 🥳🥳)
>>
>>
>>
>> Why does tesseract have that 0.7 heuristic then? That's a story for
>> another time, but it has it's uses...
>>
>> Met vriendelijke groeten / Best regards,
>>
>> Ger Hobbelt
>>
>> --------------------------------------------------
>> web:    http://www.hobbelt.com/
>>         http://www.hebbut.net/
>> mail:   [email protected]
>> mobile: +31-6-11 120 978
>> --------------------------------------------------
>>
>> On Sat, 1 Nov 2025, 06:01 Michael Schuh, <[email protected]> wrote:
>>
>>> Rucha > Green? Why?
>>>
>>> Ger > Indeed, why? (What is the thought that drove you to run this
>>> particular imagemagick command?)
>>>
>>> Fair questions.  I saw both black and white in the text so I picked a
>>> background color that does not exist in the text and has high contrast.
>>>  tesseract did a great job with the green background.  I want to process
>>> images to extract Palo Alto California tide data, date, and time and then
>>> plot the results against xtide predictions.  I am close to processing a
>>> day's worth of images collected once a minute so I will see how well the
>>> green background works.  If I have problems, I will definitely try using
>>> your (Ger and Rucha's) advice.
>>>
>>> Thank you Ger and Racha very much for your advice.
>>>
>>> Best Regards,
>>>    Michael
>>>
>>> On Fri, Oct 31, 2025 at 5:52 PM Ger Hobbelt <[email protected]>
>>> wrote:
>>>
>>>> Indeed, why? (What is the thought that drove you to run this particular
>>>> imagemagick command?)  While it might help visually debugging something
>>>> you're trying, the simplest path towards "black text on white background"
>>>> is
>>>>
>>>> 1. converting any image to greyscale. (and see for yourself if that
>>>> output is easily legible; if it's not, chances are the machine will have
>>>> trouble too, so more preprocessing /before/ the greyscale transform is
>>>> needed then)
>>>> 2. use a 'threshold' (a.k.a. binarization) step to possibly help
>>>> (though tesseract can oftentimes do a better job with greyscale instead of
>>>> hard black & white as there's more 'detail' in the image pixels then. 
>>>> YMMV).
>>>>
>>>> You can do this many ways, using imagemagick is one, openCV another.
>>>> For one-offs I use Krita / Photoshop filter layers (stacking the filters to
>>>> get what I want).
>>>> Anything really that gets you something that approaches 'crisp
>>>> dark/black text on a clean, white background, text characters about 30px
>>>> high' (dpi is irrelevant, though often mentioned elsewhere: tesseract does
>>>> digital image pixels, not classical printer mindset dots-per-inch).
>>>>
>>>> Note that 'simplest path towards' does not mean 'always the best way'.
>>>>
>>>> Met vriendelijke groeten / Best regards,
>>>>
>>>> Ger Hobbelt
>>>>
>>>> --------------------------------------------------
>>>> web:    http://www.hobbelt.com/
>>>>         http://www.hebbut.net/
>>>> mail:   [email protected]
>>>> mobile: +31-6-11 120 978
>>>> --------------------------------------------------
>>>>
>>>>
>>>> On Fri, Oct 31, 2025 at 5:46 AM Rucha Patil <
>>>> [email protected]> wrote:
>>>>
>>>>> Green? Why? I dont know if this might resolve the issue. Lmk the
>>>>> behavior I’m curious. But you need an image that has white background and
>>>>> black text. You can achieve this easily using cv2 functions.
>>>>>
>>>>> On Thu, Oct 30, 2025 at 1:26 PM Michael Schuh <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I am trying to extract the date and time from
>>>>>>
>>>>>> [image: time.png]
>>>>>>
>>>>>> I have successfully use tesseract to extract text from other images.
>>>>>> tesseract does not find any text in the above image,
>>>>>>
>>>>>>    michael@argon:~/michael/trunk/src/tides$ tesseract time.png out
>>>>>>    Estimating resolution as 142
>>>>>>
>>>>>>    michael@argon:~/michael/trunk/src/tides$ cat out.txt
>>>>>>
>>>>>>    michael@argon:~/michael/trunk/src/tides$ ls -l out.txt
>>>>>>    -rw-r----- 1 michael michael 0 Oct 30 08:58 out.txt
>>>>>>
>>>>>> Any help you can give me would be appreciated.  I attached the
>>>>>> time.png file I used above.
>>>>>>
>>>>>> Thanks,
>>>>>>    Michael
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> To view this discussion visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/77ac0d2b-7796-4f17-8bc6-0e70a9653adan%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/77ac0d2b-7796-4f17-8bc6-0e70a9653adan%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To view this discussion visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CADEFw17btz6nKqyhFKd-GXVCu7qtBQQ6gY5AV0pZJusXa4CpXg%40mail.gmail.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CADEFw17btz6nKqyhFKd-GXVCu7qtBQQ6gY5AV0pZJusXa4CpXg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To view this discussion visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fpUCz1LFq_aqk0ea6W8GR7a7mrX5%3DPdZhv6%3Dn6t-1YVrg%40mail.gmail.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fpUCz1LFq_aqk0ea6W8GR7a7mrX5%3DPdZhv6%3Dn6t-1YVrg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/CAAo-6adqVtsaoEhFxwwiXc%2Brx6uCi2zx4q7viYBZJWJMYVeeQA%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAAo-6adqVtsaoEhFxwwiXc%2Brx6uCi2zx4q7viYBZJWJMYVeeQA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fq9ppE-ad5L7yBbZGk8F9daCLMC%3DthcNB357zoJFcCW7w%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fq9ppE-ad5L7yBbZGk8F9daCLMC%3DthcNB357zoJFcCW7w%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAAo-6ad7tT6L3TcFF0gyqQy4OPz10%3DoHX39Q9PrzqQrd_Fv4tw%40mail.gmail.com.

Reply via email to