As a rule of the thumb, usually one can obtain good recognition results for all standard regular fonts of 11-16pt size, be it a screenshot or a 300 DPI scanned image. Should font size, resolution, etc. differ significantly from these numbers, recognition quality becomes a matter of experimentation.
Warm regards, Dmitri Silaev www.CustomOCR.com On Sat, Aug 20, 2011 at 2:14 PM, Sriranga(78yrsold) <[email protected]> wrote: > Dmitri, > really the issue is very complex/complicated to understand by layman user. > For training purpose in tesseract-ocr, , what is your expertise valuable > guidance to be followed by users - who uses generally depends on scanner > machine and "Print Screen"Key of the computer.. > 1)for scanning the typed text - ,(a) font size in the text should be > used.(b) resolution to be set in the scanner. > 2)For Screenshot of the typed text file= with help of Irfanview, or > imagemagic etc. resolution should be increased from 96 to 300 dpi > for any image format like tif, png etc. > With regards, > -sriranga(78yrs) > > > On Sat, Aug 20, 2011 at 1:33 PM, Dmitri Silaev <[email protected]> > wrote: >> >> There are different cases of how pixel height of a font's character >> should be calculated. If you're trying to recognize a screenshot, you >> may deem one pt to be equal to one pixel when typing it in Windows >> Paint. However this might not be true for more complex editors like >> Photoshop. Also this depends on physical size of screen's pixel and >> current video mode resolution. Another case is a scanned image, here >> pixel height depends on scanning resolution. Still another case, where >> imho trying to relate pixel height to font's point size absolutely >> lacks sense (however it is possible via some multi-parameter >> formulas), is a photographic or video frame image; here pixel height >> varies depending on the camera position and even can vary within a >> single line of text. >> >> All in all, Tesseract does not bother itself with DPIs, pt sizes, >> etc.; only pixel size is important for recognition. You can use this >> formula for scanned images to roughly determine font pixel height: >> >> pixels = DPI * pts / 72 >> >> where pixels - pixel height to be found, DPI - scanning resolution, >> pts - size of font in typographic points >> >> However the most reliable is to scan a test page and manually count >> pixels. >> >> For those willing to understand everything, here are the links: >> http://en.wikipedia.org/wiki/Dots_per_inch >> http://en.wikipedia.org/wiki/Point_%28typography%29 >> http://en.wikipedia.org/wiki/X-height >> >> Warm regards, >> Dmitri Silaev >> www.CustomOCR.com >> >> >> >> >> >> On Sat, Aug 20, 2011 at 7:48 AM, Sriranga(78yrsold) >> <[email protected]> wrote: >> > Dmitri, >> > Thanks for the valuable guidance. I seek some clarification as follows= >> > (1)"Tesseract, trained with ordinary fonts, proved good with fonts >> > of12-64 >> > pixel height" it would be nice, if indicated equivalent font size for >> > pixel >> > of 12-64? For 10 or 20 pt size of the regular(ordinary) font what is the >> > pixel height used in the Notepad? >> > I am not programmer nor developer - as such I am seeking valuable >> > guidance >> > as user. >> > BTW Is it to possible to count the pixel of any size say 20 pt of >> > regular in >> > the paint brush in which it has gird ( graph like). Just >> > now I tested in paintbrush vide screenshot attached. alphabets was typed >> > using Arial- 20 and counted pixel -it has 20 pixels. >> > >> > Thus it is presumed that 12-64 pixel height is equivalent to 12-64 point >> > size of the ordinary font - kindly confirm. >> > With warmest regards, >> > -sriranga(78yrs) >> > >> > >> > On Sat, Aug 20, 2011 at 1:00 AM, Dmitri Silaev <[email protected]> >> > wrote: >> >> >> >> The DPI measure is confusing for Tesseract's OCR, forget about it. The >> >> big thing is within-image font's x-height, measured in pixels. >> >> Tesseract, trained with ordinary fonts, proved good with fonts of >> >> 12-64 pixel height. If you have bigger characters, scale them down. If >> >> you have a font that's bold, use morphology and erode characters after >> >> binarization. Experiment. Removing "greyness" won't help as it's not a >> >> generic way of getting rid of uneven illumination; you need to use >> >> more sophisticated algorithms. Just using Photoshop won't let you >> >> achieve much. >> >> >> >> Warm regards, >> >> Dmitri Silaev >> >> www.CustomOCR.com >> >> >> >> >> >> >> >> >> >> >> >> On Fri, Aug 19, 2011 at 8:18 PM, Andriy Malovanyy <[email protected]> >> >> wrote: >> >> > To Zdenko: >> >> > I think I have 3.0 version installed, so maybe I should reinstall the >> >> > new version and try it. Thanks for the description of psm. Did you >> >> > try >> >> > to recognize other unedited images which I attached to >> >> > the first post?? >> >> > >> >> > To Rob: >> >> > Initially I had 640x480 image with 72dpi with number occupying almost >> >> > all the image. What I did is just opened the image in Photoshop, went >> >> > to size of image menu, changed the resolution to 300 dpi (image >> >> > increased in size) and set the image size back to 640x480. So, with >> >> > that I got 640x480 image with 300dpi resolution. >> >> > >> >> > On 19 Aug, 17:56, Robert Komar <[email protected]> wrote: >> >> >> On Fri, 19 Aug 2011, Andriy Malovanyy wrote: >> >> >> > To sriranga: >> >> >> > I tried changing dpi (check the previous post). It doesnt work. >> >> >> >> >> >> Did you rescale the image from 72 dpi to 300 dpi, or just change >> >> >> the tag on the original image to say 300 dpi? The latter won't >> >> >> work. >> >> >> Tesseract seems to be tuned to work best for scans at 300 dpi >> >> >> (although I've often successfully used 600 dpi). Scans done at >> >> >> 72 dpi usually get very poor results from tesseract. >> >> >> >> >> >> Cheers, >> >> >> Rob Komar >> >> > >> >> > -- >> >> > You received this message because you are subscribed to the Google >> >> > Groups "tesseract-ocr" group. >> >> > To post to this group, send email to [email protected] >> >> > To unsubscribe from this group, send email to >> >> > [email protected] >> >> > For more options, visit this group at >> >> > http://groups.google.com/group/tesseract-ocr?hl=en >> >> > >> >> >> >> -- >> >> You received this message because you are subscribed to the Google >> >> Groups "tesseract-ocr" group. >> >> To post to this group, send email to [email protected] >> >> To unsubscribe from this group, send email to >> >> [email protected] >> >> For more options, visit this group at >> >> http://groups.google.com/group/tesseract-ocr?hl=en >> > >> > -- >> > You received this message because you are subscribed to the Google >> > Groups "tesseract-ocr" group. >> > To post to this group, send email to [email protected] >> > To unsubscribe from this group, send email to >> > [email protected] >> > For more options, visit this group at >> > http://groups.google.com/group/tesseract-ocr?hl=en >> > >> >> -- >> You received this message because you are subscribed to the Google >> Groups "tesseract-ocr" group. >> To post to this group, send email to [email protected] >> To unsubscribe from this group, send email to >> [email protected] >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en > > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

