[tesseract-ocr] Re: Manual review and correction for characters outside of the Latin-1 character set

Ger Hobbelt Mon, 03 Jun 2024 16:06:56 -0700

-  "These scans include characters that are not in the Latin-1 block, which 
I read somewhere and now can't find is the limit for the English data."

Well, to put it bluntly, diving into the rabbit hole without a helmet nor a 
'chute: as far as I have been able to discover, the current "official" 
tesseract training data "databases" (neural net matrices) that are used to 
recognize anything we throw at tesseract have been produced ("trained") at 
google by Ray Smith, using copious hardware from google I expect -- 
training neural nets is no joy at the average Joe's hardware budget, after 
all. When you dig through the git commits, such as 
https://github.com/tesseract-ocr/tessdata/commits/main/ , you'll find the 
last training file *content* update was back in '17 by @theraysmith and he 
hasn't been around long after 
since: 
https://github.com/theraysmith?tab=overview&from=2017-12-01&to=2017-12-31 
-- without any hard data, my initial guess is a change of corporate google 
mind re tesseract.

Stefan Weil et al have done a lot a ton of important work since, but when 
you ask "what can this baby recognize?" that translates 1:1 to "what has 
tesseract been trained to recognize?" and there... things get a little 
vague for me. I'd love to be corrected on this, slapped on the wrist or 
worse, but from what I've gleaned so far during my research:

- though there's https://github.com/tesseract-ocr/langdata 
, https://github.com/tesseract-ocr/tesstrain 
, https://github.com/tesseract-ocr/tessdata_best/commits/main/ and Ray 
Smith's public notes and papers about what was done for tesseract v4/v5 
at https://github.com/tesseract-ocr/docs (which is separate 
from https://github.com/tesseract-ocr/tessdoc, which is more user oriented 
instead of architectural background), I am not confident that the actual 
list of training files used to produce those master traineddata LSTM files 
(= tesseract v4/v5 OCR engine) are checked into git: I have seen a list of 
font names used some place in there (or was it the mailing list?), but for 
anyone who works with fonts that already is a handwavey kinda thing and, 
yes, copyrights, yadayada, will forever prevent something more precise to 
be available because the list most certainly included commercial fonts. 
Then there's also the training input files defining the "text lines" to be 
rendered as training material: those actually determine which glyphs in the 
fonts will be trained at all (and in what combinations). And there I am not 
feeling confident either, as it looks like those files published are the 
ones from the older v3 engine, still relevant, but *probably* not what Ray 
was using to produce those many traineddata files he did at the google shop.
Having dug through the git histories, inspected the various files, scripts 
and notes about 2 years ago, I cannot say with complete confidence whether 
your (C), TM and 1/2, 3/4, etc. fraction glyphs have made it into the 
training set for English back then. My *guess* is that they have been 
included, if only a few samples, so the neural net will have *some* 
recollection of them, if my guess is correct, but I also expect them to 
have "featured little" in the total training process so recognition chances 
are reduced.

(Aside: As we focus on the English language training set here, I didn't 
mention the metric ton of work done by @Shreeshrii for Asian scripts, 
particularly Devanagari and related, a few years later. As far as I can 
tell, most of the `traineddata` scripts and process today are due to his 
work and Stefan Weil's, who, if you look over there, you'll note has done a 
lot of work around OCR-ing (pre-war) German newpapers and similar 
publications, which was when the Germans had a fondness of printing 
everything in (to my eyes) quite hard to read blackletter fonts. To make 
that feat happen, he and the university team (of several German uni's 
together, if I read what was done right, back when) created a 
German-specific training set for newspaper blackletter print and published 
the resulting tesseract traineddata OCR databases for public use (language: 
"fra" = fraktur). I don't recall seeing a publication where he lists the 
number of CPU hours used to produce that trained set (one(1) language, few 
fonts vs. the 400+ allegedly used in the google production run) but you can 
bet your bottom it wasn't cheap! Or quick!)

When we pop out of the rabbit hole of tesseract history, we might now 
better understand why your problem is answered... haphazardly:

- general advice number 1 out there is to 'tune' a language training file 
if you have special needs, such as your wish to recognize fractions, etc., 
which don't feature often in published texts and thus haven't been a real 
bother thus far. This "tuning" advice is basically training advice to do a 
little extra training, which is, to me, a little hairy as you are expected 
to not deteriorate the existing recognition ability while *slightly 
improving* the recognition confidence (and thus output quality) for a few 
glyphs ("characters in your fonts") that are already mostly recognized by 
the neural net as it recognizes part or all of the relevant "shapes" that 
make up the glyphs you wish to see recognized. (This is a very rough 
translation of what a neural net "learns" vs. how we humans might 
understand pattern recognition, so tread carefully around this blather of 
mine if you think you're getting a look under the hood. We're rather more 
*paraphrasing* the engine instead of pointing at its carburetor, spark 
plugs, etc., if you get my drift.)

Logically, this approach is met with varying success (and crushed hopes) as 
it is VERY much dependent on the exact shapes and glyphs (characters) you 
add.   (TM) might be helped by being quite close to a T+M superscript, 
while the fractions being a combo of superscript, subscript and a / slash 
might be doable or hard for the LSTM+CTC engine, I cannot tell without 
having tried. And training takes time, both in setting it up and in CPU 
cycles, so it's not a 5 minute thing to do. Which explains another type of 
silence around here.

- if that didn't work, you will read several folks advising to "lop off the 
top layer" and retrain the whole language. What this says is that, 
basically, the attempt is to wipe just one of the many layers of the 
LSTM+CTC neural net where it is expected to 'conclude' things like "ah... 
that there and this shapy thingamajig here, all that jazz is very probably 
an 'a'..." and hope that that lopping-off-and-retraining suffices to get 
acceptable training results after running the training for a while (& 
checking you're doing all right and not overtraining other bits and pieces 
of the engine's alphabet/text output!)
This takes rather more time than "tuning" as you must now retrain at least 
an entire layer, while tuning was only intended to have the training 
activity result in a few cell connections in there being tweaked a little 
to get what you wanted.

- general advice number 3 is to do what the Germans did and train a 
dedicated "language", which means you'll need to do all the work of 
creating font(s), text line training files which include (hopefully) every 
word and symbol you may ever encounter later on and then cook one CPU or 
more for some considerable time. I consider that effort approaching 
herculean, particularly when you're alone. Some have tried, and a few even 
succeeded it seems from the noises I recall for the last couple of years 
lurking on this mailing list.

By now you'll surely have gotten the gist of it: from the distance of a 
mailing list POV, it's all a guess and there's so many little details 
involved to arrive at success that almost nobody dares venture saying much, 
at least not all at once. Because this stuff is *hard* to get right and the 
above can be a cause for scare with some folks. 

Me personally, I tried my hand at "tuning" a little about a year ago and it 
didn't fare well, because I found out I still didn't understand all the 
processes involved well enough to make decisions that would differ from 
joining a crap shoot blindfolded. But that is me and I am not into the 
adrenalin rush of bungee jumping either, so it probably says more about me 
than about the process of training/tuning tesseract.

Having mentioned the above three options, my personal favorite advice 
number 4 is: try to come up with a way which can keep tesseract as-is, and 
adding a review/correction post-process that is acceptable for you. If you 
find it in your heart to accept that a little copy-editing after the OCR 
actions is A-okay, you are probably better off, both in time spent and 
frustration with machines' ways. After all, the initial setup cost for this 
option is much less for single-person shops, I expect. ;-)  (The break-even 
would be a fairly large number of pages to process...)

- "I've got a mostly English language set of scans (image quality is good 
but not great, but best I can do without a better scanner"

Personal experience to date is image preprocessing is a "field of active 
research" (i.e. you need to try and test all your own and any others' ideas 
that sound more or less reasonable) and has a very strong effect on the 
outcome of the OCR stage. For instance, you may want to rescale your 
scanned images and see at which text pixel height they do well/best; 
previous research says text at 30-33 pixels height is optimal, but yours 
might differ a little from that, so experiment! (I'll try to do a tesseract 
run on an image you posted earlier later tomorrow at very resize sizes to 
see what comes out that one.)

Ditto for post-processing: it might be useful, if the content is important 
enough to you, to dump it into a word processor / text editor with 
spellchecker on board for further assistance. A manual review process of 
some kind is called for, anyway, if you want consistent (very) high quality 
output.

There's also processors/tools that can do "smart quotes" if you like, but I 
would reserve that for last; my initial approach there would be to have the 
OCR engine spit out quotes where-ever they occur and then convert them to 
"smart" open/close quotes in post, if I wanted. French quotes would 
potentially be easier to OCR that way (as they appear at different vertical 
offsets) but I'ld be glad to have *any* kind of quote coming out of the OCR 
machine: the training sets have been trained on a gazillion fonts and 
intricate little typography details like "smart quotes" are rather font 
specific, so recognizing them from an OCR engine's perspective screams 
"tuning! dedicated font training!" and a little headache starts to develop 
over here. ;-))

- "Slightly related, how, exactly, do y'all deal with drop caps?"

Errrrm, AFAICT.... we don't. Apologies.          Seriously though, I don't 
recall any positive success info on that one. 

Here my initial gut response is to "recognize" the drop caps in 
preprocessor, i.e. in the "image segmentation phase" and cut them out 
specifically to have them extracted, rescaled to a sensible "regular text 
size" and only then fed into the OCR engine. Afterwards the output then has 
to be recombined with the rest of the image segments' text produce. BUT 
that is mere theory as tesseract does not yet have a module/subprocess to 
"identify" possible dropcaps and segment and process them as I just 
described. Which means that today, you either do that up front and do the 
recombining afterwards in your own custom postprocess, or you decide to 
accept a little extra editorial post work by either keeping them in as-is 
(and expecting errors or at least uncertainties reported by the OCR engine) 
or maybe tipp-ex-ing ;-) them out in preprocessing and hoping the engine's 
built-in dictionary resolves half of them due to spelling correction. Any 
way, this is all currently non-existent, alas, so anything you come up with 
is better than what is, today.

(I am working on my own copy of tesseract which might improve this a 
little, but don't expect any miracles there this quarter. I'm /slow/.)

The 'tesseract does best with 30-33pixel high text' stuff is at: - 
https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ
I 
wrote https://groups.google.com/g/tesseract-ocr/c/B2-EVXPLovQ/m/lP0zQVApAAAJ 
a while ago; maybe the diagram in there and some paragraphs there aid 
understanding what's going under the hood, which' info I think you need, 
like I did/do.

Take care,

Ger

P.S.: it was lying around for a gander, but my tesseract is buggered ATM. 
Anyway, I installed an "official distro" one yesterday for other purposes 
and I'll see how your previously posted scans fare with that one when I 
test a few things on them. To be reported later this week, possibly 
tomorrow afternoon.

On Monday, May 20, 2024 at 5:02:24 AM UTC+2 [email protected] wrote:

> I've asked a couple different times, and each time I get just a little bit 
> more information, but still not enough to work with.
>
> I've got a mostly English language set of scans (image quality is good but 
> not great, but best I can do without a better scanner, I'm working on that 
> to re-scan but there are some problems that still wouldn't be fixed). These 
> scans include characters that are not in the Latin-1 block, which I read 
> somewhere and now can't find is the limit for the English data. Example 
> characters not being recognized include fractions ( ⅛ ⅔ instead of 1/8 or 
> 2/3), the TM ( ™ ) or C ( © ) symbols (latter is actually in Latin 1, but 
> isn't directly typeable and, from what I've been able to tell, the circled 
> part comes out so faint on the input image, tesseract thinks it is noise) 
> and "smart" or curly quotes - all characters that require using alt+ codes, 
> insert special character dialogs or letting your wordprocessor/DTP handle 
> converting for you. Which seems to mean they require some level of manual 
> review and correction to be able to get it into the text output. BUT, once 
> you see you need to input manually, how do you handle the positioning data 
> (when working in hocr format)? I considered, briefly, using character 
> whitelisting to help with these, but, that would imply the characters are 
> already included in the character set/wordlist, which if memory serves, 
> many of these aren't?
>
> Slightly related, how, exactly, do y'all deal with drop caps?
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/dc048b53-0767-4167-9976-819d2a2e0d8fn%40googlegroups.com.

[tesseract-ocr] Re: Manual review and correction for characters outside of the Latin-1 character set

Reply via email to