- "These scans include characters that are not in the Latin-1 block, which
I read somewhere and now can't find is the limit for the English data."
Well, to put it bluntly, diving into the rabbit hole without a helmet nor a
'chute: as far as I have been able to discover, the current "official"
tesseract training data "databases" (neural net matrices) that are used to
recognize anything we throw at tesseract have been produced ("trained") at
google by Ray Smith, using copious hardware from google I expect --
training neural nets is no joy at the average Joe's hardware budget, after
all. When you dig through the git commits, such as
https://github.com/tesseract-ocr/tessdata/commits/main/ , you'll find the
last training file *content* update was back in '17 by @theraysmith and he
hasn't been around long after
since:
https://github.com/theraysmith?tab=overview&from=2017-12-01&to=2017-12-31
-- without any hard data, my initial guess is a change of corporate google
mind re tesseract.
Stefan Weil et al have done a lot a ton of important work since, but when
you ask "what can this baby recognize?" that translates 1:1 to "what has
tesseract been trained to recognize?" and there... things get a little
vague for me. I'd love to be corrected on this, slapped on the wrist or
worse, but from what I've gleaned so far during my research:
- though there's https://github.com/tesseract-ocr/langdata
, https://github.com/tesseract-ocr/tesstrain
, https://github.com/tesseract-ocr/tessdata_best/commits/main/ and Ray
Smith's public notes and papers about what was done for tesseract v4/v5
at https://github.com/tesseract-ocr/docs (which is separate
from https://github.com/tesseract-ocr/tessdoc, which is more user oriented
instead of architectural background), I am not confident that the actual
list of training files used to produce those master traineddata LSTM files
(= tesseract v4/v5 OCR engine) are checked into git: I have seen a list of
font names used some place in there (or was it the mailing list?), but for
anyone who works with fonts that already is a handwavey kinda thing and,
yes, copyrights, yadayada, will forever prevent something more precise to
be available because the list most certainly included commercial fonts.
Then there's also the training input files defining the "text lines" to be
rendered as training material: those actually determine which glyphs in the
fonts will be trained at all (and in what combinations). And there I am not
feeling confident either, as it looks like those files published are the
ones from the older v3 engine, still relevant, but *probably* not what Ray
was using to produce those many traineddata files he did at the google shop.
Having dug through the git histories, inspected the various files, scripts
and notes about 2 years ago, I cannot say with complete confidence whether
your (C), TM and 1/2, 3/4, etc. fraction glyphs have made it into the
training set for English back then. My *guess* is that they have been
included, if only a few samples, so the neural net will have *some*
recollection of them, if my guess is correct, but I also expect them to
have "featured little" in the total training process so recognition chances
are reduced.
(Aside: As we focus on the English language training set here, I didn't
mention the metric ton of work done by @Shreeshrii for Asian scripts,
particularly Devanagari and related, a few years later. As far as I can
tell, most of the `traineddata` scripts and process today are due to his
work and Stefan Weil's, who, if you look over there, you'll note has done a
lot of work around OCR-ing (pre-war) German newpapers and similar
publications, which was when the Germans had a fondness of printing
everything in (to my eyes) quite hard to read blackletter fonts. To make
that feat happen, he and the university team (of several German uni's
together, if I read what was done right, back when) created a
German-specific training set for newspaper blackletter print and published
the resulting tesseract traineddata OCR databases for public use (language:
"fra" = fraktur). I don't recall seeing a publication where he lists the
number of CPU hours used to produce that trained set (one(1) language, few
fonts vs. the 400+ allegedly used in the google production run) but you can
bet your bottom it wasn't cheap! Or quick!)
When we pop out of the rabbit hole of tesseract history, we might now
better understand why your problem is answered... haphazardly:
- general advice number 1 out there is to 'tune' a language training file
if you have special needs, such as your wish to recognize fractions, etc.,
which don't feature often in published texts and thus haven't been a real
bother thus far. This "tuning" advice is basically training advice to do a
little extra training, which is, to me, a little hairy as you are expected
to not deteriorate the existing recognition ability while *slightly
improving* the recognition confidence (and thus output quality) for a few
glyphs ("characters in your fonts") that are already mostly recognized by
the neural net as it recognizes part or all of the relevant "shapes" that
make up the glyphs you wish to see recognized. (This is a very rough
translation of what a neural net "learns" vs. how we humans might
understand pattern recognition, so tread carefully around this blather of
mine if you think you're getting a look under the hood. We're rather more
*paraphrasing* the engine instead of pointing at its carburetor, spark
plugs, etc., if you get my drift.)
Logically, this approach is met with varying success (and crushed hopes) as
it is VERY much dependent on the exact shapes and glyphs (characters) you
add. (TM) might be helped by being quite close to a T+M superscript,
while the fractions being a combo of superscript, subscript and a / slash
might be doable or hard for the LSTM+CTC engine, I cannot tell without
having tried. And training takes time, both in setting it up and in CPU
cycles, so it's not a 5 minute thing to do. Which explains another type of
silence around here.
- if that didn't work, you will read several folks advising to "lop off the
top layer" and retrain the whole language. What this says is that,
basically, the attempt is to wipe just one of the many layers of the
LSTM+CTC neural net where it is expected to 'conclude' things like "ah...
that there and this shapy thingamajig here, all that jazz is very probably
an 'a'..." and hope that that lopping-off-and-retraining suffices to get
acceptable training results after running the training for a while (&
checking you're doing all right and not overtraining other bits and pieces
of the engine's alphabet/text output!)
This takes rather more time than "tuning" as you must now retrain at least
an entire layer, while tuning was only intended to have the training
activity result in a few cell connections in there being tweaked a little
to get what you wanted.
- general advice number 3 is to do what the Germans did and train a
dedicated "language", which means you'll need to do all the work of
creating font(s), text line training files which include (hopefully) every
word and symbol you may ever encounter later on and then cook one CPU or
more for some considerable time. I consider that effort approaching
herculean, particularly when you're alone. Some have tried, and a few even
succeeded it seems from the noises I recall for the last couple of years
lurking on this mailing list.
By now you'll surely have gotten the gist of it: from the distance of a
mailing list POV, it's all a guess and there's so many little details
involved to arrive at success that almost nobody dares venture saying much,
at least not all at once. Because this stuff is *hard* to get right and the
above can be a cause for scare with some folks.
Me personally, I tried my hand at "tuning" a little about a year ago and it
didn't fare well, because I found out I still didn't understand all the
processes involved well enough to make decisions that would differ from
joining a crap shoot blindfolded. But that is me and I am not into the
adrenalin rush of bungee jumping either, so it probably says more about me
than about the process of training/tuning tesseract.
Having mentioned the above three options, my personal favorite advice
number 4 is: try to come up with a way which can keep tesseract as-is, and
adding a review/correction post-process that is acceptable for you. If you
find it in your heart to accept that a little copy-editing after the OCR
actions is A-okay, you are probably better off, both in time spent and
frustration with machines' ways. After all, the initial setup cost for this
option is much less for single-person shops, I expect. ;-) (The break-even
would be a fairly large number of pages to process...)
- "I've got a mostly English language set of scans (image quality is good
but not great, but best I can do without a better scanner"
Personal experience to date is image preprocessing is a "field of active
research" (i.e. you need to try and test all your own and any others' ideas
that sound more or less reasonable) and has a very strong effect on the
outcome of the OCR stage. For instance, you may want to rescale your
scanned images and see at which text pixel height they do well/best;
previous research says text at 30-33 pixels height is optimal, but yours
might differ a little from that, so experiment! (I'll try to do a tesseract
run on an image you posted earlier later tomorrow at very resize sizes to
see what comes out that one.)
Ditto for post-processing: it might be useful, if the content is important
enough to you, to dump it into a word processor / text editor with
spellchecker on board for further assistance. A manual review process of
some kind is called for, anyway, if you want consistent (very) high quality
output.
There's also processors/tools that can do "smart quotes" if you like, but I
would reserve that for last; my initial approach there would be to have the
OCR engine spit out quotes where-ever they occur and then convert them to
"smart" open/close quotes in post, if I wanted. French quotes would
potentially be easier to OCR that way (as they appear at different vertical
offsets) but I'ld be glad to have *any* kind of quote coming out of the OCR
machine: the training sets have been trained on a gazillion fonts and
intricate little typography details like "smart quotes" are rather font
specific, so recognizing them from an OCR engine's perspective screams
"tuning! dedicated font training!" and a little headache starts to develop
over here. ;-))
- "Slightly related, how, exactly, do y'all deal with drop caps?"
Errrrm, AFAICT.... we don't. Apologies. Seriously though, I don't
recall any positive success info on that one.
Here my initial gut response is to "recognize" the drop caps in
preprocessor, i.e. in the "image segmentation phase" and cut them out
specifically to have them extracted, rescaled to a sensible "regular text
size" and only then fed into the OCR engine. Afterwards the output then has
to be recombined with the rest of the image segments' text produce. BUT
that is mere theory as tesseract does not yet have a module/subprocess to
"identify" possible dropcaps and segment and process them as I just
described. Which means that today, you either do that up front and do the
recombining afterwards in your own custom postprocess, or you decide to
accept a little extra editorial post work by either keeping them in as-is
(and expecting errors or at least uncertainties reported by the OCR engine)
or maybe tipp-ex-ing ;-) them out in preprocessing and hoping the engine's
built-in dictionary resolves half of them due to spelling correction. Any
way, this is all currently non-existent, alas, so anything you come up with
is better than what is, today.
(I am working on my own copy of tesseract which might improve this a
little, but don't expect any miracles there this quarter. I'm /slow/.)
The 'tesseract does best with 30-33pixel high text' stuff is at: -
https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ
I
wrote https://groups.google.com/g/tesseract-ocr/c/B2-EVXPLovQ/m/lP0zQVApAAAJ
a while ago; maybe the diagram in there and some paragraphs there aid
understanding what's going under the hood, which' info I think you need,
like I did/do.
Take care,
Ger
P.S.: it was lying around for a gander, but my tesseract is buggered ATM.
Anyway, I installed an "official distro" one yesterday for other purposes
and I'll see how your previously posted scans fare with that one when I
test a few things on them. To be reported later this week, possibly
tomorrow afternoon.
On Monday, May 20, 2024 at 5:02:24 AM UTC+2 [email protected] wrote:
> I've asked a couple different times, and each time I get just a little bit
> more information, but still not enough to work with.
>
> I've got a mostly English language set of scans (image quality is good but
> not great, but best I can do without a better scanner, I'm working on that
> to re-scan but there are some problems that still wouldn't be fixed). These
> scans include characters that are not in the Latin-1 block, which I read
> somewhere and now can't find is the limit for the English data. Example
> characters not being recognized include fractions ( ⅛ ⅔ instead of 1/8 or
> 2/3), the TM ( ™ ) or C ( © ) symbols (latter is actually in Latin 1, but
> isn't directly typeable and, from what I've been able to tell, the circled
> part comes out so faint on the input image, tesseract thinks it is noise)
> and "smart" or curly quotes - all characters that require using alt+ codes,
> insert special character dialogs or letting your wordprocessor/DTP handle
> converting for you. Which seems to mean they require some level of manual
> review and correction to be able to get it into the text output. BUT, once
> you see you need to input manually, how do you handle the positioning data
> (when working in hocr format)? I considered, briefly, using character
> whitelisting to help with these, but, that would imply the characters are
> already included in the character set/wordlist, which if memory serves,
> many of these aren't?
>
> Slightly related, how, exactly, do y'all deal with drop caps?
>
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/dc048b53-0767-4167-9976-819d2a2e0d8fn%40googlegroups.com.