On Thu, 15 Feb 2024, 17:06 Ger Hobbelt, <g...@hobbelt.com> wrote: > Re "X" checkbox: > > More shorthand examples in your "input language":
Tabl. = tablet (pill) tägl = täglich (German: daily dosage) I mention these extra examples (visible in the scanned images) as I find generally people have a hard time wrapping their head around the CS "language" word as it is CS-specific jargon: a "language" is both the structure and all the "words" (vocabulary) you use. As such, "tägl", "tabl", etc. are just so many more plain *words* in the language used here. A machine doesn't know or care about human smartness constructing shorthand or acronyms. For a recognizer, it's basically just more words that are just that much harder to recognize correctly as they have increased entropy (less internal structure) compared to the other, more usual, words in the language used. Tesseract, and any OCR engine, recognizes a trained (CS jargon!) *language*. If a *word*, which you and I may call a shorthand or otherwise, did not feature in the training set, then the "hidden Markov model"-simile in the engine will rank the raw initial pattern recognition result a bit or a lot lower, depending on circumstances, and thus you will observe lower scores for untrained jargon or regular *wirds* with typos in them, such as the " *wirds*" just now. (It would like to read "words" or "wards", but "wurds" and "wirds" are unlisted, hence English language errors and thus, while possibly correctly recognizing it as "wirds", will surely rate that word a (slightly?) lowered score.) Acronyms, for example "YMMV", are, from a Markov chain / machine perspective, completely *nuts* as there's no other word in the English language dictionary that contains the "mmv" triple consonants combo. Hence any recognizer must be explicitly trained to recognize it, by including it in the training dictionary, and by now you'll realize it will require additional training rounds due to its weirdness of having "mmv" in there, plus the moderately rare "(SOW)Y" starter ( (SOW) = start of word edge marker): "you", "yoghurt", "ypsilon", ... The Y section in your old printed dictionary wasn't all that large either, but it's common enough to having been picked up during training. The "mmv" will kill it, score-wise, if "YMMV" wasn't in the training set. ( What OCR system designers do is pass such stuff along as severely lowered scores marking it as doubtful/untrustworthy/WTF, which I dramatise as "killing it") Ditto for your (German and semi numerical) shorthands: "3x" for three times was hopefully part of the trained language model. I haven't checked, I don't know. Anyway, if that word score drops too low, tesseract decides not to list the word at all in its output. Lots of folks entering this mailing list suffer that fundamental issue: lower scores and output *silence* due to feeding tesseract "*wirds*" that do not exist in the chosen models' training sets, such as product SKUs. The issue is often compounded by other score-decreasing circumstances, for nothing is truly easy here. Modern high grade recognizers all have implicitly embedded Markov models (think: trained dictionaries plus word stemmings and ~-endings; this thought model is off but close enough for initial comprehension) so you cannot "switch off / disable" the language dictionary for tesseract v4/v5 models like you could/can for the old v3 ones (which obviously do worse in general) and consequently you cannot prevent the engine from "downgrading" shorthand and other words unknown at the training phase. The corollary of this is: this is why medical and legal recognizers for speech to text and print to text are highly specialized and dedicated endeavours which come at a steep price. Because the consequences of an *additional* mistake are very expensive, in all regards, not just liability, but also ethically and ....... >> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fp2m5biCjDXgnU4B1XUswYFwszeMZzU-y7ZLabfo67hAQ%40mail.gmail.com.