Hi,

More on this later (I seem to still have issues posting with attachments
here, plus running into a few surprises while doing bulk testing, so this
is preliminary):

1. Dont use lossy image file formats if you can, so PNG is better than
JPEG. From what I see, if you need lossy due to storage limitations, it
seems webp is better than JPEG. Has to do with the type of noise jpeg
introduces as "jpeg artifacts".

2. Scale (resize, use imagemagick or other tool to do this in bulk) the
input image to approximate 30px capital letter height for each line. That's
the ballpark, do try a couple of scales near that measure, e.g. test
results with a set of scaled images 5% off to see which scale is 'optimal'
for you. It can help to then run an additional test set with scales in a
1-2% geometric scale range (i.e. next scale to try is 102% of previous
smaller test size).

How to check: output both hocr and tsv outputs with character confidence
reporting turned on (tesseract hocr output for character confidence is
broken, those numbers only show in tsv), then read those files and check
both character and word confidence values output by tesseract. Pick the
scaling+misc preprocessing that gives you the highest numbers there on
average for your test set.


After that, it depends...

BTW: to my eye your image isn't noisy and you mention noise, hence: you got
a few rotten ones for us?  ;-)


Re noise, preprocessing: what I find helps is killing (masking) all noise
that is a few pixels away from any character. Particularly when you are
processing low dpi / jpeg input. This must be done before feeding it to
tesseract as current tesseract does thresholding, etc for detecting the
spots where the text (words) are at, but the latest engine (LSTM) is fed
the raw input pixels so any useless noise ends up in there and degrades
output.


TLDR:

- scale
- Denoise
- enhance contrast (not necessary in your case)
- ... other means to make image easier legible, anything goes ...
- dictionary, etc. for tesseract or post: I see you've got jargon in there
(susp, iss, ...) which are not regular English dictionary words, so it
might help to use a custom dict, but don't have hard data on that one yet
myself)




On Mon, 1 Jul 2024, 06:21 Ralph Cook, <[email protected]> wrote:

> I have an application using Tesseract on documents which are all in
> English, one font, everything I want to recognize is in capital letters,
> digits, and punctuation.
>
> The quality of the scans is often poor, and I have no control over that.
> It's sometimes about what you would expect with pages that are scanned,
> printed, then scanned again; lots of noise, characters not distinct, etc.
>
> I don't know what the font is, I call it "Old Line Printer". Here's a
> sample:
>
> [image: Sample text anonymized.png]
>
> I have erased some identifying information and scratched some lines where
> it went.
>
> I am not familiar with OCR technology in general, nor with neural
> networks. I've read in the documentation abouto how to improve the image,
> some things about training, some things about how training is likely not
> necessary, etc. I'm looking for someone to recommend an overall strategy:
> what should I try first, what is the best 2nd plan, is there likely to be a
> 3rd, etc. I'm trying not to spend weeks studying the wrong things.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/185590fa-c34f-4775-a8a8-9f2bfd18c09en%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/185590fa-c34f-4775-a8a8-9f2bfd18c09en%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFP60foqyTc5aYQtdZR5Kkm%2BV7gpUP8m0MyP%3DkfsJEiwR0Bpyg%40mail.gmail.com.

Reply via email to