Thank you for your detailed answer.

g...@hobbelt.com schrieb am Donnerstag, 15. Februar 2024 um 18:51:05 UTC+1:

> Re tesseract output for "mittag" etc in your sample: first port of call 
> for "cleaning up dot matrix printer" for OCR, i.e. dedicated image 
> preprocessing would be googling
>
> leptonica image morphology, open close expand dilate dot matrix
>
> or some such.
>
> While I would go with using leptonica for that, as tesseract already uses 
> the same lib and I'd rather code this in c++ or shell /node, the opencv 
> documentation for the same math ops is more intuitive to me. 
> https://docs.opencv.org/4.x/d9/d61/tutorial_py_morphological_ops.html
>
> This is always finicky stuff so getting the parameters just right is an 
> exercise left to the reader today. ;-)
>
> I do recall dot matrix images woes mentioned before in this ML, but it's a 
> long while back and a quick search didn't dig up those conversations' hrefs.
>
>
>
> On Thu, 15 Feb 2024, 18:18 Ger Hobbelt, <g...@hobbelt.com> wrote:
>
>>
>>
>> On Thu, 15 Feb 2024, 17:06 Ger Hobbelt, <g...@hobbelt.com> wrote:
>>
>>> Re "X" checkbox:
>>>
>>>
>> More shorthand examples in your "input language":
>>
>> Tabl.  = tablet (pill)
>> tägl   = täglich (German: daily dosage)
>>
>>
>> I mention these extra examples (visible in the scanned images) as I find 
>> generally people have a hard time wrapping their head around the CS 
>> "language" word as it is CS-specific jargon: a "language" is both the 
>> structure and all the "words" (vocabulary) you use. As such, "tägl", 
>> "tabl", etc. are just so many more plain *words* in the language used here. 
>> A machine doesn't know or care about human smartness constructing shorthand 
>> or acronyms. For a recognizer, it's basically just more words that are just 
>> that much harder to recognize correctly as they have increased entropy 
>> (less internal structure) compared to the other, more usual, words in the 
>> language used.
>>
>> Tesseract, and any OCR engine, recognizes a trained (CS jargon!) 
>> *language*. If a *word*, which you and I may call a shorthand or 
>> otherwise, did not feature in the training set, then the "hidden Markov 
>> model"-simile in the engine will rank the raw initial pattern recognition 
>> result a bit or a lot lower, depending on circumstances, and thus you will 
>> observe lower scores for untrained jargon or regular *wirds* with typos 
>> in them, such as the "*wirds*" just now. (It would like to read "words" 
>> or "wards", but "wurds" and "wirds" are unlisted, hence English language 
>> errors and thus, while possibly correctly recognizing it as "wirds", will 
>> surely rate that word a (slightly?) lowered score.)
>>
>> Acronyms, for example "YMMV", are, from a Markov chain / machine 
>> perspective, completely *nuts* as there's no other word in the English 
>> language dictionary that contains the "mmv" triple consonants combo. Hence 
>> any recognizer must be explicitly trained to recognize it, by including it 
>> in the training dictionary, and by now you'll realize it will require 
>> additional training rounds due to its weirdness of having "mmv" in there, 
>> plus the moderately rare "(SOW)Y" starter ( (SOW) = start of word edge 
>> marker): "you", "yoghurt", "ypsilon", ... The Y section in your old printed 
>> dictionary wasn't all that large either, but it's common enough to having 
>> been picked up during training. The "mmv" will kill it, score-wise, if 
>> "YMMV" wasn't in the training set. ( What OCR system designers do is pass 
>> such stuff along as severely lowered scores marking it as 
>> doubtful/untrustworthy/WTF, which I dramatise as "killing it")
>>
>> Ditto for your (German and semi numerical) shorthands: "3x" for three 
>> times was hopefully part of the trained language model. I haven't checked, 
>> I don't know.
>>
>> Anyway, if that word score drops too low, tesseract decides not to list 
>> the word at all in its output. Lots of folks entering this mailing list 
>> suffer that fundamental issue: lower scores and output *silence* due to 
>> feeding tesseract "*wirds*" that do not exist in the chosen models' 
>> training sets, such as product SKUs. The issue is often compounded by other 
>> score-decreasing circumstances, for nothing is truly easy here.
>>
>> Modern high grade recognizers all have implicitly embedded Markov models 
>> (think: trained dictionaries plus word stemmings and ~-endings; this 
>> thought model is off but close enough for initial comprehension) so you 
>> cannot "switch off / disable" the language dictionary for tesseract v4/v5 
>> models like you could/can for the old v3 ones (which obviously do worse in 
>> general) and consequently you cannot prevent the engine from "downgrading" 
>> shorthand and other words unknown at the training phase.
>>
>> The corollary of this is: this is why medical and legal recognizers for 
>> speech to text and print to text are highly specialized and dedicated 
>> endeavours which come at a steep price. Because the consequences of an 
>> *additional* mistake are very expensive, in all regards, not just 
>> liability, but also ethically and .......
>>
>>
>>
>>
>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5f3b3ea0-23ad-42a1-b54c-2fc98d722a2en%40googlegroups.com.

Reply via email to