Hi, More on this later (I seem to still have issues posting with attachments here, plus running into a few surprises while doing bulk testing, so this is preliminary):
1. Dont use lossy image file formats if you can, so PNG is better than JPEG. From what I see, if you need lossy due to storage limitations, it seems webp is better than JPEG. Has to do with the type of noise jpeg introduces as "jpeg artifacts". 2. Scale (resize, use imagemagick or other tool to do this in bulk) the input image to approximate 30px capital letter height for each line. That's the ballpark, do try a couple of scales near that measure, e.g. test results with a set of scaled images 5% off to see which scale is 'optimal' for you. It can help to then run an additional test set with scales in a 1-2% geometric scale range (i.e. next scale to try is 102% of previous smaller test size). How to check: output both hocr and tsv outputs with character confidence reporting turned on (tesseract hocr output for character confidence is broken, those numbers only show in tsv), then read those files and check both character and word confidence values output by tesseract. Pick the scaling+misc preprocessing that gives you the highest numbers there on average for your test set. After that, it depends... BTW: to my eye your image isn't noisy and you mention noise, hence: you got a few rotten ones for us? ;-) Re noise, preprocessing: what I find helps is killing (masking) all noise that is a few pixels away from any character. Particularly when you are processing low dpi / jpeg input. This must be done before feeding it to tesseract as current tesseract does thresholding, etc for detecting the spots where the text (words) are at, but the latest engine (LSTM) is fed the raw input pixels so any useless noise ends up in there and degrades output. TLDR: - scale - Denoise - enhance contrast (not necessary in your case) - ... other means to make image easier legible, anything goes ... - dictionary, etc. for tesseract or post: I see you've got jargon in there (susp, iss, ...) which are not regular English dictionary words, so it might help to use a custom dict, but don't have hard data on that one yet myself) On Mon, 1 Jul 2024, 06:21 Ralph Cook, <[email protected]> wrote: > I have an application using Tesseract on documents which are all in > English, one font, everything I want to recognize is in capital letters, > digits, and punctuation. > > The quality of the scans is often poor, and I have no control over that. > It's sometimes about what you would expect with pages that are scanned, > printed, then scanned again; lots of noise, characters not distinct, etc. > > I don't know what the font is, I call it "Old Line Printer". Here's a > sample: > > [image: Sample text anonymized.png] > > I have erased some identifying information and scratched some lines where > it went. > > I am not familiar with OCR technology in general, nor with neural > networks. I've read in the documentation abouto how to improve the image, > some things about training, some things about how training is likely not > necessary, etc. I'm looking for someone to recommend an overall strategy: > what should I try first, what is the best 2nd plan, is there likely to be a > 3rd, etc. I'm trying not to spend weeks studying the wrong things. > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/185590fa-c34f-4775-a8a8-9f2bfd18c09en%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/185590fa-c34f-4775-a8a8-9f2bfd18c09en%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAFP60foqyTc5aYQtdZR5Kkm%2BV7gpUP8m0MyP%3DkfsJEiwR0Bpyg%40mail.gmail.com.

