Thank you for your detailed answer. g...@hobbelt.com schrieb am Donnerstag, 15. Februar 2024 um 18:51:05 UTC+1:
> Re tesseract output for "mittag" etc in your sample: first port of call > for "cleaning up dot matrix printer" for OCR, i.e. dedicated image > preprocessing would be googling > > leptonica image morphology, open close expand dilate dot matrix > > or some such. > > While I would go with using leptonica for that, as tesseract already uses > the same lib and I'd rather code this in c++ or shell /node, the opencv > documentation for the same math ops is more intuitive to me. > https://docs.opencv.org/4.x/d9/d61/tutorial_py_morphological_ops.html > > This is always finicky stuff so getting the parameters just right is an > exercise left to the reader today. ;-) > > I do recall dot matrix images woes mentioned before in this ML, but it's a > long while back and a quick search didn't dig up those conversations' hrefs. > > > > On Thu, 15 Feb 2024, 18:18 Ger Hobbelt, <g...@hobbelt.com> wrote: > >> >> >> On Thu, 15 Feb 2024, 17:06 Ger Hobbelt, <g...@hobbelt.com> wrote: >> >>> Re "X" checkbox: >>> >>> >> More shorthand examples in your "input language": >> >> Tabl. = tablet (pill) >> tägl = täglich (German: daily dosage) >> >> >> I mention these extra examples (visible in the scanned images) as I find >> generally people have a hard time wrapping their head around the CS >> "language" word as it is CS-specific jargon: a "language" is both the >> structure and all the "words" (vocabulary) you use. As such, "tägl", >> "tabl", etc. are just so many more plain *words* in the language used here. >> A machine doesn't know or care about human smartness constructing shorthand >> or acronyms. For a recognizer, it's basically just more words that are just >> that much harder to recognize correctly as they have increased entropy >> (less internal structure) compared to the other, more usual, words in the >> language used. >> >> Tesseract, and any OCR engine, recognizes a trained (CS jargon!) >> *language*. If a *word*, which you and I may call a shorthand or >> otherwise, did not feature in the training set, then the "hidden Markov >> model"-simile in the engine will rank the raw initial pattern recognition >> result a bit or a lot lower, depending on circumstances, and thus you will >> observe lower scores for untrained jargon or regular *wirds* with typos >> in them, such as the "*wirds*" just now. (It would like to read "words" >> or "wards", but "wurds" and "wirds" are unlisted, hence English language >> errors and thus, while possibly correctly recognizing it as "wirds", will >> surely rate that word a (slightly?) lowered score.) >> >> Acronyms, for example "YMMV", are, from a Markov chain / machine >> perspective, completely *nuts* as there's no other word in the English >> language dictionary that contains the "mmv" triple consonants combo. Hence >> any recognizer must be explicitly trained to recognize it, by including it >> in the training dictionary, and by now you'll realize it will require >> additional training rounds due to its weirdness of having "mmv" in there, >> plus the moderately rare "(SOW)Y" starter ( (SOW) = start of word edge >> marker): "you", "yoghurt", "ypsilon", ... The Y section in your old printed >> dictionary wasn't all that large either, but it's common enough to having >> been picked up during training. The "mmv" will kill it, score-wise, if >> "YMMV" wasn't in the training set. ( What OCR system designers do is pass >> such stuff along as severely lowered scores marking it as >> doubtful/untrustworthy/WTF, which I dramatise as "killing it") >> >> Ditto for your (German and semi numerical) shorthands: "3x" for three >> times was hopefully part of the trained language model. I haven't checked, >> I don't know. >> >> Anyway, if that word score drops too low, tesseract decides not to list >> the word at all in its output. Lots of folks entering this mailing list >> suffer that fundamental issue: lower scores and output *silence* due to >> feeding tesseract "*wirds*" that do not exist in the chosen models' >> training sets, such as product SKUs. The issue is often compounded by other >> score-decreasing circumstances, for nothing is truly easy here. >> >> Modern high grade recognizers all have implicitly embedded Markov models >> (think: trained dictionaries plus word stemmings and ~-endings; this >> thought model is off but close enough for initial comprehension) so you >> cannot "switch off / disable" the language dictionary for tesseract v4/v5 >> models like you could/can for the old v3 ones (which obviously do worse in >> general) and consequently you cannot prevent the engine from "downgrading" >> shorthand and other words unknown at the training phase. >> >> The corollary of this is: this is why medical and legal recognizers for >> speech to text and print to text are highly specialized and dedicated >> endeavours which come at a steep price. Because the consequences of an >> *additional* mistake are very expensive, in all regards, not just >> liability, but also ethically and ....... >> >> >> >> >> >>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5f3b3ea0-23ad-42a1-b54c-2fc98d722a2en%40googlegroups.com.