Ger, Thank you very, very much for your detailed explanation of how tesseract processes my image!
I now see the wisdom and cpu time benefits of creating a black text on white background image. I will work on doing this. Best Regards, Michael On Sat, Nov 1, 2025 at 8:50 AM Ger Hobbelt <[email protected]> wrote: > (apologies for the typos and uncorrected mobile phone autocorrect eff-ups > in that text just now) > > Met vriendelijke groeten / Best regards, > > Ger Hobbelt > > -------------------------------------------------- > web: http://www.hobbelt.com/ > http://www.hebbut.net/ > mail: [email protected] > mobile: +31-6-11 120 978 > -------------------------------------------------- > > On Sat, 1 Nov 2025, 16:48 Ger Hobbelt, <[email protected]> wrote: > >> I suspected something like this. >> >> FYI a technical detail that is very relevant for your case: when somebody >> feeds tesseract a white text on dark background image, tesseract OFTEN >> SEEMS TO WORK. Until you think it's doing fine and you get a very hard to >> notice lower total quality of OCR output than with comparable white text on >> black background. Here's what's going on under the hood and why I >> emphathetically advise everybody to NEVER feed tesseract white in black: >> >> Tesseract code picks up your image and looks at its metadata: width, >> height and RGB/number of colors. Fine so far. >> Now it goes and looks at the image pixels and runs a so-called >> segmentation process. Fundamentally, it runs it's own thresholding filter >> over your pixels to produce a pure 0/1 black & white picture copy: this one >> is simpler and faster to search as tesseract applies algorithms to discover >> the position and size of each but if text: the bounding-boxes list. Every >> box (a horizontal rectangle) surround [one] [word] [each]. Like I did with >> the square brackets [•••] just now. (for c++ code readers: yes, in skipping >> stuff and not being *exact* in what happens. RTFC of you want the absolute >> and definitive truth.) >> >> Now each of these b-boxes (bounding boxes) are clipped (extracted) from >> your source image and fed, one vertical pixel line after another, into the >> LSTM OCR engine, which spots out a synchronous stream of probabilities: >> think "30% chance that was an 'a' just now, 83% chance is was a 'd' and 57% >> chance I was looking at a 'b' instead. Meanwhile here's all the rest of the >> alphabet but their chances are very low indeed." >> So the next bit of tesseract logic looks at this and picks the highest >> probable occurrence: 'd'. (Again, wat more complex than this, but this is >> the base of it all and very relevant for our "don't ever do white-on-black >> while it might seem to work just fine right now!" >> >> By the time tesseract has 'decoded' the perceived word in that little >> b-box image, it may have 'read' the word 'dank', for example. The 'd' was >> just the first character in there. >> Tesseract ALSO has collected the top rankings (you may have noticed that >> my 'probabilities' did not add up to 100%, so we call them rankings instead >> of probabilities). >> It also calculated a ranking for the word as a whole, say 78% (and >> rankings are not real percentages so I'm lying through my teeth here. RTFC >> if you need that for comfort. Meanwhile I stick to the storyline here...) >> >> Now there's a tiny single line of code in tesseract which now gets to >> look at that number. It is one of the many "heuristics" in there. And it >> says: "if this word ranking is below 0.7 (70%), we need to TRY AGAIN: >> Invert(!!!) that word box image and run it through the engine once more! >> When you're done, compare the ranking of the word you got this second time >> around and may the best one win!" >> For a human, the heuristic seems obvious and flawless. In actual practice >> however, the engine can be a little crazy sometimes when it's fed horribly >> unexpected pixel input and there's a small bit noticeable number of times >> where the gibberish wins because the engine got stoned as a squirrel and >> announced the inverted pixels have a 71% ranking for 'Q0618'. Highest >> bidder wins and you get gibberish (at best) or a totally incorrect word >> like 'quirk' at worst: both are very wrong, but your chances of discovering >> the second example fault is nigh impossible, particularly when you have >> automated this process as you process images in bulk. >> >> Two ways (3, rather!) this has a detrimental affect on your output ice >> quality: >> >> 1: if you start with white-on-black, tesseract 'segmentation' has to deal >> with white-on-black too and my findings are: the b-boxes discovery delivers >> worse results. That bad in two ways and both (2) and (3) don't receive >> optimal input image clippings. >> 2: by now you will have guessed it: you started with white-on-black >> (white-on-green in your specific case) so the first round through tesseract >> is feeding it a bunch of highly unexpected 'crap' it was never taught to >> deal with: gibberish is the result and lots of 'words' arrive at that >> heuristic with rankings way below that 0.7 benchmark, so the second saves >> your ass by rerunning the INVERTED image and very probably observing >> serious winners that time, so everything LOOKS good for the test image. >> >> Meanwhile, we know that the tesseract engine, like any neural net, can go >> nuts and output gibberish at surprising high confidence rankings: assuming >> your first run delivered gibberish with such a high confidence, barely or >> quite a lot higher than the 0.7 benchmark, you WILL NOT GET TGAT SECOND RUN >> and thus crazy stuff will be your end result. Ouch. >> >> 3: same as (2) but now twisted in the other direction: tesseract has a >> bout of self-doubt somehow (computer pixel fonts like yours are a candidate >> for this) and thus produces the intended word 'dank' during the second run >> but at a surprisingly LOW ranking if, say, 65%, while first round gibberish >> had the rather idiotic ranking of 67%, still below the 0.7 benchmark but >> "winner takes all" now has to obey and let the gibberish pass anyhow: >> 'dank' scored just a wee bit lower! >> Again, fat failure in terms of total quality of output, but it happens. >> Rarely, but often enough to screw you up. >> >> Of course you can argue the same from the by-design black-on-white input, >> so what's the real catch here?! When you ensure, BEFOREHAND, that tesseract >> receives black-on-white, high contrast, input images, (1) Will do a better >> job, hence reducing your total error rate. (2) is a non-scenario now >> because your first round gets black-on-white, as everybody trained for, so >> no crazy confusion this way. Thus another, notable, improvement in total >> error rate /quality. >> (3) still happens, but in the reverse order: the first round produces the >> intended 'dank' word at low confidence, so second round is run and >> gibberish wins, OUCH!, **but** the actual probability of this happening >> just dropped a lot as your 'not passing the benchmark' test is now >> dependent on the 'lacking confidence' scenario part, which is (obviously?) >> *rarer* than the *totally-confused-but-rather-confident* first part of the >> original scenario (3). >> >> Thus all 3 failure modes have a significantly lower probability of >> actually occurring when you feed tesseract black-on-white text, as it was >> designed to eat that kind of porridge. >> >> Therefor: high contrast is good. Better yet: flip it around (Invert the >> image), possibly after having done the to-grwyscale conversion yourself, as >> well. Your images will thank you (bonus points! Not having to execute the >> second run means spending about half the time in the CPU-intensive neural >> net: higher performance and fewer errors all at the same time 🥳🥳) >> >> >> >> Why does tesseract have that 0.7 heuristic then? That's a story for >> another time, but it has it's uses... >> >> Met vriendelijke groeten / Best regards, >> >> Ger Hobbelt >> >> -------------------------------------------------- >> web: http://www.hobbelt.com/ >> http://www.hebbut.net/ >> mail: [email protected] >> mobile: +31-6-11 120 978 >> -------------------------------------------------- >> >> On Sat, 1 Nov 2025, 06:01 Michael Schuh, <[email protected]> wrote: >> >>> Rucha > Green? Why? >>> >>> Ger > Indeed, why? (What is the thought that drove you to run this >>> particular imagemagick command?) >>> >>> Fair questions. I saw both black and white in the text so I picked a >>> background color that does not exist in the text and has high contrast. >>> tesseract did a great job with the green background. I want to process >>> images to extract Palo Alto California tide data, date, and time and then >>> plot the results against xtide predictions. I am close to processing a >>> day's worth of images collected once a minute so I will see how well the >>> green background works. If I have problems, I will definitely try using >>> your (Ger and Rucha's) advice. >>> >>> Thank you Ger and Racha very much for your advice. >>> >>> Best Regards, >>> Michael >>> >>> On Fri, Oct 31, 2025 at 5:52 PM Ger Hobbelt <[email protected]> >>> wrote: >>> >>>> Indeed, why? (What is the thought that drove you to run this particular >>>> imagemagick command?) While it might help visually debugging something >>>> you're trying, the simplest path towards "black text on white background" >>>> is >>>> >>>> 1. converting any image to greyscale. (and see for yourself if that >>>> output is easily legible; if it's not, chances are the machine will have >>>> trouble too, so more preprocessing /before/ the greyscale transform is >>>> needed then) >>>> 2. use a 'threshold' (a.k.a. binarization) step to possibly help >>>> (though tesseract can oftentimes do a better job with greyscale instead of >>>> hard black & white as there's more 'detail' in the image pixels then. >>>> YMMV). >>>> >>>> You can do this many ways, using imagemagick is one, openCV another. >>>> For one-offs I use Krita / Photoshop filter layers (stacking the filters to >>>> get what I want). >>>> Anything really that gets you something that approaches 'crisp >>>> dark/black text on a clean, white background, text characters about 30px >>>> high' (dpi is irrelevant, though often mentioned elsewhere: tesseract does >>>> digital image pixels, not classical printer mindset dots-per-inch). >>>> >>>> Note that 'simplest path towards' does not mean 'always the best way'. >>>> >>>> Met vriendelijke groeten / Best regards, >>>> >>>> Ger Hobbelt >>>> >>>> -------------------------------------------------- >>>> web: http://www.hobbelt.com/ >>>> http://www.hebbut.net/ >>>> mail: [email protected] >>>> mobile: +31-6-11 120 978 >>>> -------------------------------------------------- >>>> >>>> >>>> On Fri, Oct 31, 2025 at 5:46 AM Rucha Patil < >>>> [email protected]> wrote: >>>> >>>>> Green? Why? I dont know if this might resolve the issue. Lmk the >>>>> behavior I’m curious. But you need an image that has white background and >>>>> black text. You can achieve this easily using cv2 functions. >>>>> >>>>> On Thu, Oct 30, 2025 at 1:26 PM Michael Schuh <[email protected]> >>>>> wrote: >>>>> >>>>>> I am trying to extract the date and time from >>>>>> >>>>>> [image: time.png] >>>>>> >>>>>> I have successfully use tesseract to extract text from other images. >>>>>> tesseract does not find any text in the above image, >>>>>> >>>>>> michael@argon:~/michael/trunk/src/tides$ tesseract time.png out >>>>>> Estimating resolution as 142 >>>>>> >>>>>> michael@argon:~/michael/trunk/src/tides$ cat out.txt >>>>>> >>>>>> michael@argon:~/michael/trunk/src/tides$ ls -l out.txt >>>>>> -rw-r----- 1 michael michael 0 Oct 30 08:58 out.txt >>>>>> >>>>>> Any help you can give me would be appreciated. I attached the >>>>>> time.png file I used above. >>>>>> >>>>>> Thanks, >>>>>> Michael >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To view this discussion visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/77ac0d2b-7796-4f17-8bc6-0e70a9653adan%40googlegroups.com >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/77ac0d2b-7796-4f17-8bc6-0e70a9653adan%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To view this discussion visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/CADEFw17btz6nKqyhFKd-GXVCu7qtBQQ6gY5AV0pZJusXa4CpXg%40mail.gmail.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CADEFw17btz6nKqyhFKd-GXVCu7qtBQQ6gY5AV0pZJusXa4CpXg%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fpUCz1LFq_aqk0ea6W8GR7a7mrX5%3DPdZhv6%3Dn6t-1YVrg%40mail.gmail.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fpUCz1LFq_aqk0ea6W8GR7a7mrX5%3DPdZhv6%3Dn6t-1YVrg%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion visit >>> https://groups.google.com/d/msgid/tesseract-ocr/CAAo-6adqVtsaoEhFxwwiXc%2Brx6uCi2zx4q7viYBZJWJMYVeeQA%40mail.gmail.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/CAAo-6adqVtsaoEhFxwwiXc%2Brx6uCi2zx4q7viYBZJWJMYVeeQA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion visit > https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fq9ppE-ad5L7yBbZGk8F9daCLMC%3DthcNB357zoJFcCW7w%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fq9ppE-ad5L7yBbZGk8F9daCLMC%3DthcNB357zoJFcCW7w%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/CAAo-6ad7tT6L3TcFF0gyqQy4OPz10%3DoHX39Q9PrzqQrd_Fv4tw%40mail.gmail.com.

