Re: [tesseract-ocr] Suggestions wanted on how to improve recognition

Ger Hobbelt Mon, 01 Jul 2024 13:27:20 -0700

TIFF should be okay (IIRC that not a lossy compression format, usually).  

The advice re image formats is most relevant when you preprocess your 
scanned TIFF images: always use a lossless format, e.g. PNG, as 
intermediate output format, so when, for example, using imagemagick, do


      magick -input.tiff   -resize WxH     image.png
      tesseract ........ image.png

instead of 

      magick -input.tiff   -resize WxH     image.jpg
      tesseract ........ image.jpg


Cheers,

Ger



On Monday, July 1, 2024 at 1:29:33 PM UTC+2 [email protected] wrote:

> I"ll look into the scaling and denoising. 
>
> I have no control over the input format. If you mean to take the TIFF 
> image I've got and convert it before OCR, please say that.
>
> Yes, the example I gave was not one of the noisy inputs. I've looked 
> through the ones I have handy, and none of them seem to be that bad -- I'll 
> look up some of poor quality and post those as well.
>
> Thanks.
>
> On Monday, July 1, 2024 at 4:18:31 AM UTC-4 [email protected] wrote:
>
>> Hi, 
>>
>> More on this later (I seem to still have issues posting with attachments 
>> here, plus running into a few surprises while doing bulk testing, so this 
>> is preliminary):
>>
>> 1. Dont use lossy image file formats if you can, so PNG is better than 
>> JPEG. From what I see, if you need lossy due to storage limitations, it 
>> seems webp is better than JPEG. Has to do with the type of noise jpeg 
>> introduces as "jpeg artifacts".
>>
>> 2. Scale (resize, use imagemagick or other tool to do this in bulk) the 
>> input image to approximate 30px capital letter height for each line. That's 
>> the ballpark, do try a couple of scales near that measure, e.g. test 
>> results with a set of scaled images 5% off to see which scale is 'optimal' 
>> for you. It can help to then run an additional test set with scales in a 
>> 1-2% geometric scale range (i.e. next scale to try is 102% of previous 
>> smaller test size).
>>
>> How to check: output both hocr and tsv outputs with character confidence 
>> reporting turned on (tesseract hocr output for character confidence is 
>> broken, those numbers only show in tsv), then read those files and check 
>> both character and word confidence values output by tesseract. Pick the 
>> scaling+misc preprocessing that gives you the highest numbers there on 
>> average for your test set.
>>
>>
>> After that, it depends...
>>
>> BTW: to my eye your image isn't noisy and you mention noise, hence: you 
>> got a few rotten ones for us?  ;-)
>>
>>
>> Re noise, preprocessing: what I find helps is killing (masking) all noise 
>> that is a few pixels away from any character. Particularly when you are 
>> processing low dpi / jpeg input. This must be done before feeding it to 
>> tesseract as current tesseract does thresholding, etc for detecting the 
>> spots where the text (words) are at, but the latest engine (LSTM) is fed 
>> the raw input pixels so any useless noise ends up in there and degrades 
>> output.
>>
>>
>> TLDR:
>>
>> - scale
>> - Denoise
>> - enhance contrast (not necessary in your case)
>> - ... other means to make image easier legible, anything goes ...
>> - dictionary, etc. for tesseract or post: I see you've got jargon in 
>> there (susp, iss, ...) which are not regular English dictionary words, so 
>> it might help to use a custom dict, but don't have hard data on that one 
>> yet myself)
>>
>>
>>
>>
>> On Mon, 1 Jul 2024, 06:21 Ralph Cook, <[email protected]> wrote:
>>
>>> I have an application using Tesseract on documents which are all in 
>>> English, one font, everything I want to recognize is in capital letters, 
>>> digits, and punctuation. 
>>>
>>> The quality of the scans is often poor, and I have no control over that. 
>>> It's sometimes about what you would expect with pages that are scanned, 
>>> printed, then scanned again; lots of noise, characters not distinct, etc.
>>>
>>> I don't know what the font is, I call it "Old Line Printer". Here's a 
>>> sample:
>>>
>>> [image: Sample text anonymized.png]
>>>
>>> I have erased some identifying information and scratched some lines 
>>> where it went.
>>>
>>> I am not familiar with OCR technology in general, nor with neural 
>>> networks. I've read in the documentation abouto how to improve the image, 
>>> some things about training, some things about how training is likely not 
>>> necessary, etc. I'm looking for someone to recommend an overall strategy: 
>>> what should I try first, what is the best 2nd plan, is there likely to be a 
>>> 3rd, etc. I'm trying not to spend weeks studying the wrong things.
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/185590fa-c34f-4775-a8a8-9f2bfd18c09en%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/185590fa-c34f-4775-a8a8-9f2bfd18c09en%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ca22c9e7-4937-4544-9922-f77cc654d2abn%40googlegroups.com.

Re: [tesseract-ocr] Suggestions wanted on how to improve recognition

Reply via email to