TIFF should be okay (IIRC that not a lossy compression format, usually).
The advice re image formats is most relevant when you preprocess your
scanned TIFF images: always use a lossless format, e.g. PNG, as
intermediate output format, so when, for example, using imagemagick, do
magick -input.tiff -resize WxH image.png
tesseract ........ image.png
instead of
magick -input.tiff -resize WxH image.jpg
tesseract ........ image.jpg
Cheers,
Ger
On Monday, July 1, 2024 at 1:29:33 PM UTC+2 [email protected] wrote:
> I"ll look into the scaling and denoising.
>
> I have no control over the input format. If you mean to take the TIFF
> image I've got and convert it before OCR, please say that.
>
> Yes, the example I gave was not one of the noisy inputs. I've looked
> through the ones I have handy, and none of them seem to be that bad -- I'll
> look up some of poor quality and post those as well.
>
> Thanks.
>
> On Monday, July 1, 2024 at 4:18:31 AM UTC-4 [email protected] wrote:
>
>> Hi,
>>
>> More on this later (I seem to still have issues posting with attachments
>> here, plus running into a few surprises while doing bulk testing, so this
>> is preliminary):
>>
>> 1. Dont use lossy image file formats if you can, so PNG is better than
>> JPEG. From what I see, if you need lossy due to storage limitations, it
>> seems webp is better than JPEG. Has to do with the type of noise jpeg
>> introduces as "jpeg artifacts".
>>
>> 2. Scale (resize, use imagemagick or other tool to do this in bulk) the
>> input image to approximate 30px capital letter height for each line. That's
>> the ballpark, do try a couple of scales near that measure, e.g. test
>> results with a set of scaled images 5% off to see which scale is 'optimal'
>> for you. It can help to then run an additional test set with scales in a
>> 1-2% geometric scale range (i.e. next scale to try is 102% of previous
>> smaller test size).
>>
>> How to check: output both hocr and tsv outputs with character confidence
>> reporting turned on (tesseract hocr output for character confidence is
>> broken, those numbers only show in tsv), then read those files and check
>> both character and word confidence values output by tesseract. Pick the
>> scaling+misc preprocessing that gives you the highest numbers there on
>> average for your test set.
>>
>>
>> After that, it depends...
>>
>> BTW: to my eye your image isn't noisy and you mention noise, hence: you
>> got a few rotten ones for us? ;-)
>>
>>
>> Re noise, preprocessing: what I find helps is killing (masking) all noise
>> that is a few pixels away from any character. Particularly when you are
>> processing low dpi / jpeg input. This must be done before feeding it to
>> tesseract as current tesseract does thresholding, etc for detecting the
>> spots where the text (words) are at, but the latest engine (LSTM) is fed
>> the raw input pixels so any useless noise ends up in there and degrades
>> output.
>>
>>
>> TLDR:
>>
>> - scale
>> - Denoise
>> - enhance contrast (not necessary in your case)
>> - ... other means to make image easier legible, anything goes ...
>> - dictionary, etc. for tesseract or post: I see you've got jargon in
>> there (susp, iss, ...) which are not regular English dictionary words, so
>> it might help to use a custom dict, but don't have hard data on that one
>> yet myself)
>>
>>
>>
>>
>> On Mon, 1 Jul 2024, 06:21 Ralph Cook, <[email protected]> wrote:
>>
>>> I have an application using Tesseract on documents which are all in
>>> English, one font, everything I want to recognize is in capital letters,
>>> digits, and punctuation.
>>>
>>> The quality of the scans is often poor, and I have no control over that.
>>> It's sometimes about what you would expect with pages that are scanned,
>>> printed, then scanned again; lots of noise, characters not distinct, etc.
>>>
>>> I don't know what the font is, I call it "Old Line Printer". Here's a
>>> sample:
>>>
>>> [image: Sample text anonymized.png]
>>>
>>> I have erased some identifying information and scratched some lines
>>> where it went.
>>>
>>> I am not familiar with OCR technology in general, nor with neural
>>> networks. I've read in the documentation abouto how to improve the image,
>>> some things about training, some things about how training is likely not
>>> necessary, etc. I'm looking for someone to recommend an overall strategy:
>>> what should I try first, what is the best 2nd plan, is there likely to be a
>>> 3rd, etc. I'm trying not to spend weeks studying the wrong things.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/185590fa-c34f-4775-a8a8-9f2bfd18c09en%40googlegroups.com
>>>
>>> <https://groups.google.com/d/msgid/tesseract-ocr/185590fa-c34f-4775-a8a8-9f2bfd18c09en%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/ca22c9e7-4937-4544-9922-f77cc654d2abn%40googlegroups.com.