On Thu, 15 Feb 2024, 17:06 Ger Hobbelt, <g...@hobbelt.com> wrote:

> Re "X" checkbox:
>
>
More shorthand examples in your "input language":

Tabl.  = tablet (pill)
tägl   = täglich (German: daily dosage)


I mention these extra examples (visible in the scanned images) as I find
generally people have a hard time wrapping their head around the CS
"language" word as it is CS-specific jargon: a "language" is both the
structure and all the "words" (vocabulary) you use. As such, "tägl",
"tabl", etc. are just so many more plain *words* in the language used here.
A machine doesn't know or care about human smartness constructing shorthand
or acronyms. For a recognizer, it's basically just more words that are just
that much harder to recognize correctly as they have increased entropy
(less internal structure) compared to the other, more usual, words in the
language used.

Tesseract, and any OCR engine, recognizes a trained (CS jargon!) *language*.
If a *word*, which you and I may call a shorthand or otherwise, did not
feature in the training set, then the "hidden Markov model"-simile in the
engine will rank the raw initial pattern recognition result a bit or a lot
lower, depending on circumstances, and thus you will observe lower scores
for untrained jargon or regular *wirds* with typos in them, such as the "
*wirds*" just now. (It would like to read "words" or "wards", but "wurds"
and "wirds" are unlisted, hence English language errors and thus, while
possibly correctly recognizing it as "wirds", will surely rate that word a
(slightly?) lowered score.)

Acronyms, for example "YMMV", are, from a Markov chain / machine
perspective, completely *nuts* as there's no other word in the English
language dictionary that contains the "mmv" triple consonants combo. Hence
any recognizer must be explicitly trained to recognize it, by including it
in the training dictionary, and by now you'll realize it will require
additional training rounds due to its weirdness of having "mmv" in there,
plus the moderately rare "(SOW)Y" starter ( (SOW) = start of word edge
marker): "you", "yoghurt", "ypsilon", ... The Y section in your old printed
dictionary wasn't all that large either, but it's common enough to having
been picked up during training. The "mmv" will kill it, score-wise, if
"YMMV" wasn't in the training set. ( What OCR system designers do is pass
such stuff along as severely lowered scores marking it as
doubtful/untrustworthy/WTF, which I dramatise as "killing it")

Ditto for your (German and semi numerical) shorthands: "3x" for three times
was hopefully part of the trained language model. I haven't checked, I
don't know.

Anyway, if that word score drops too low, tesseract decides not to list the
word at all in its output. Lots of folks entering this mailing list suffer
that fundamental issue: lower scores and output *silence* due to feeding
tesseract "*wirds*" that do not exist in the chosen models' training sets,
such as product SKUs. The issue is often compounded by other
score-decreasing circumstances, for nothing is truly easy here.

Modern high grade recognizers all have implicitly embedded Markov models
(think: trained dictionaries plus word stemmings and ~-endings; this
thought model is off but close enough for initial comprehension) so you
cannot "switch off / disable" the language dictionary for tesseract v4/v5
models like you could/can for the old v3 ones (which obviously do worse in
general) and consequently you cannot prevent the engine from "downgrading"
shorthand and other words unknown at the training phase.

The corollary of this is: this is why medical and legal recognizers for
speech to text and print to text are highly specialized and dedicated
endeavours which come at a steep price. Because the consequences of an
*additional* mistake are very expensive, in all regards, not just
liability, but also ethically and .......





>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fp2m5biCjDXgnU4B1XUswYFwszeMZzU-y7ZLabfo67hAQ%40mail.gmail.com.

Reply via email to