Re tesseract output for "mittag" etc in your sample: first port of call for
"cleaning up dot matrix printer" for OCR, i.e. dedicated image
preprocessing would be googling

leptonica image morphology, open close expand dilate dot matrix

or some such.

While I would go with using leptonica for that, as tesseract already uses
the same lib and I'd rather code this in c++ or shell /node, the opencv
documentation for the same math ops is more intuitive to me.
https://docs.opencv.org/4.x/d9/d61/tutorial_py_morphological_ops.html

This is always finicky stuff so getting the parameters just right is an
exercise left to the reader today. ;-)

I do recall dot matrix images woes mentioned before in this ML, but it's a
long while back and a quick search didn't dig up those conversations' hrefs.



On Thu, 15 Feb 2024, 18:18 Ger Hobbelt, <g...@hobbelt.com> wrote:

>
>
> On Thu, 15 Feb 2024, 17:06 Ger Hobbelt, <g...@hobbelt.com> wrote:
>
>> Re "X" checkbox:
>>
>>
> More shorthand examples in your "input language":
>
> Tabl.  = tablet (pill)
> tägl   = täglich (German: daily dosage)
>
>
> I mention these extra examples (visible in the scanned images) as I find
> generally people have a hard time wrapping their head around the CS
> "language" word as it is CS-specific jargon: a "language" is both the
> structure and all the "words" (vocabulary) you use. As such, "tägl",
> "tabl", etc. are just so many more plain *words* in the language used here.
> A machine doesn't know or care about human smartness constructing shorthand
> or acronyms. For a recognizer, it's basically just more words that are just
> that much harder to recognize correctly as they have increased entropy
> (less internal structure) compared to the other, more usual, words in the
> language used.
>
> Tesseract, and any OCR engine, recognizes a trained (CS jargon!)
> *language*. If a *word*, which you and I may call a shorthand or
> otherwise, did not feature in the training set, then the "hidden Markov
> model"-simile in the engine will rank the raw initial pattern recognition
> result a bit or a lot lower, depending on circumstances, and thus you will
> observe lower scores for untrained jargon or regular *wirds* with typos
> in them, such as the "*wirds*" just now. (It would like to read "words"
> or "wards", but "wurds" and "wirds" are unlisted, hence English language
> errors and thus, while possibly correctly recognizing it as "wirds", will
> surely rate that word a (slightly?) lowered score.)
>
> Acronyms, for example "YMMV", are, from a Markov chain / machine
> perspective, completely *nuts* as there's no other word in the English
> language dictionary that contains the "mmv" triple consonants combo. Hence
> any recognizer must be explicitly trained to recognize it, by including it
> in the training dictionary, and by now you'll realize it will require
> additional training rounds due to its weirdness of having "mmv" in there,
> plus the moderately rare "(SOW)Y" starter ( (SOW) = start of word edge
> marker): "you", "yoghurt", "ypsilon", ... The Y section in your old printed
> dictionary wasn't all that large either, but it's common enough to having
> been picked up during training. The "mmv" will kill it, score-wise, if
> "YMMV" wasn't in the training set. ( What OCR system designers do is pass
> such stuff along as severely lowered scores marking it as
> doubtful/untrustworthy/WTF, which I dramatise as "killing it")
>
> Ditto for your (German and semi numerical) shorthands: "3x" for three
> times was hopefully part of the trained language model. I haven't checked,
> I don't know.
>
> Anyway, if that word score drops too low, tesseract decides not to list
> the word at all in its output. Lots of folks entering this mailing list
> suffer that fundamental issue: lower scores and output *silence* due to
> feeding tesseract "*wirds*" that do not exist in the chosen models'
> training sets, such as product SKUs. The issue is often compounded by other
> score-decreasing circumstances, for nothing is truly easy here.
>
> Modern high grade recognizers all have implicitly embedded Markov models
> (think: trained dictionaries plus word stemmings and ~-endings; this
> thought model is off but close enough for initial comprehension) so you
> cannot "switch off / disable" the language dictionary for tesseract v4/v5
> models like you could/can for the old v3 ones (which obviously do worse in
> general) and consequently you cannot prevent the engine from "downgrading"
> shorthand and other words unknown at the training phase.
>
> The corollary of this is: this is why medical and legal recognizers for
> speech to text and print to text are highly specialized and dedicated
> endeavours which come at a steep price. Because the consequences of an
> *additional* mistake are very expensive, in all regards, not just
> liability, but also ethically and .......
>
>
>
>
>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fqrjNegVHP4-D0HJpdFjR4T_cNbN5x6HsNp9xotmiWFrQ%40mail.gmail.com.

Reply via email to