As a rule of the thumb, usually one can obtain good recognition
results for all standard regular fonts of 11-16pt size, be it a
screenshot or a 300 DPI scanned image. Should font size, resolution,
etc. differ significantly from these numbers, recognition quality
becomes a matter of experimentation.

Warm regards,
Dmitri Silaev
www.CustomOCR.com





On Sat, Aug 20, 2011 at 2:14 PM, Sriranga(78yrsold)
<[email protected]> wrote:
> Dmitri,
> really the issue is very complex/complicated to understand by layman user.
> For training purpose in tesseract-ocr, , what is your expertise valuable
> guidance to be followed by users - who uses generally depends on scanner
> machine and "Print Screen"Key of the computer..
> 1)for scanning the typed text  - ,(a) font size in the text should be
> used.(b) resolution to be set in the scanner.
> 2)For Screenshot of the typed text file= with help of Irfanview, or
> imagemagic etc. resolution should be increased from 96 to 300 dpi
> for any image format like tif, png etc.
> With regards,
> -sriranga(78yrs)
>
>
> On Sat, Aug 20, 2011 at 1:33 PM, Dmitri Silaev <[email protected]>
> wrote:
>>
>> There are different cases of how pixel height of a font's character
>> should be calculated. If you're trying to recognize a screenshot, you
>> may deem one pt to be equal to one pixel when typing it in Windows
>> Paint. However this might not be true for more complex editors like
>> Photoshop. Also this depends on physical size of screen's pixel and
>> current video mode resolution. Another case is a scanned image, here
>> pixel height depends on scanning resolution. Still another case, where
>> imho trying to relate pixel height to font's point size absolutely
>> lacks sense (however it is possible via some multi-parameter
>> formulas), is a photographic or video frame image; here pixel height
>> varies depending on the camera position and even can vary within a
>> single line of text.
>>
>> All in all, Tesseract does not bother itself with DPIs, pt sizes,
>> etc.; only pixel size is important for recognition. You can use this
>> formula for scanned images to roughly determine font pixel height:
>>
>> pixels = DPI * pts / 72
>>
>> where pixels - pixel height to be found, DPI - scanning resolution,
>> pts - size of font in typographic points
>>
>> However the most reliable is to scan a test page and manually count
>> pixels.
>>
>> For those willing to understand everything, here are the links:
>> http://en.wikipedia.org/wiki/Dots_per_inch
>> http://en.wikipedia.org/wiki/Point_%28typography%29
>> http://en.wikipedia.org/wiki/X-height
>>
>> Warm regards,
>> Dmitri Silaev
>> www.CustomOCR.com
>>
>>
>>
>>
>>
>> On Sat, Aug 20, 2011 at 7:48 AM, Sriranga(78yrsold)
>> <[email protected]> wrote:
>> > Dmitri,
>> > Thanks for the valuable guidance. I seek some clarification as follows=
>> > (1)"Tesseract, trained with ordinary fonts, proved good with fonts
>> > of12-64
>> > pixel height" it would be nice, if indicated equivalent font size for
>> > pixel
>> > of 12-64? For 10 or 20 pt size of the regular(ordinary) font what is the
>> > pixel height used in the Notepad?
>> > I am not programmer nor developer - as such I am seeking valuable
>> > guidance
>> > as user.
>> > BTW Is it to possible to count the pixel of any size say 20 pt of
>> > regular in
>> > the paint brush in which it has gird ( graph like). Just
>> > now I tested in paintbrush vide screenshot attached. alphabets was typed
>> > using Arial- 20 and  counted pixel -it has 20 pixels.
>> >
>> > Thus it is presumed that 12-64 pixel height is equivalent to 12-64 point
>> > size of the ordinary font - kindly confirm.
>> > With warmest regards,
>> > -sriranga(78yrs)
>> >
>> >
>> > On Sat, Aug 20, 2011 at 1:00 AM, Dmitri Silaev <[email protected]>
>> > wrote:
>> >>
>> >> The DPI measure is confusing for Tesseract's OCR, forget about it. The
>> >> big thing is within-image font's x-height, measured in pixels.
>> >> Tesseract, trained with ordinary fonts, proved good with fonts of
>> >> 12-64 pixel height. If you have bigger characters, scale them down. If
>> >> you have a font that's bold, use morphology and erode characters after
>> >> binarization. Experiment. Removing "greyness" won't help as it's not a
>> >> generic way of getting rid of uneven illumination; you need to use
>> >> more sophisticated algorithms. Just using Photoshop won't let you
>> >> achieve much.
>> >>
>> >> Warm regards,
>> >> Dmitri Silaev
>> >> www.CustomOCR.com
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Fri, Aug 19, 2011 at 8:18 PM, Andriy Malovanyy <[email protected]>
>> >> wrote:
>> >> > To Zdenko:
>> >> > I think I have 3.0 version installed, so maybe I should reinstall the
>> >> > new version and try it. Thanks for the description of psm. Did you
>> >> > try
>> >> > to recognize other unedited images which I attached to
>> >> > the first post??
>> >> >
>> >> > To Rob:
>> >> > Initially I had 640x480 image with 72dpi with number occupying almost
>> >> > all the image. What I did is just opened the image in Photoshop, went
>> >> > to size of image menu, changed the resolution to 300 dpi (image
>> >> > increased in size) and set the image size back to 640x480. So, with
>> >> > that I got 640x480 image with 300dpi resolution.
>> >> >
>> >> > On 19 Aug, 17:56, Robert Komar <[email protected]> wrote:
>> >> >> On Fri, 19 Aug 2011, Andriy Malovanyy wrote:
>> >> >> > To sriranga:
>> >> >> > I tried changing dpi (check the previous post). It doesnt work.
>> >> >>
>> >> >> Did you rescale the image from 72 dpi to 300 dpi, or just change
>> >> >> the tag on the original image to say 300 dpi?  The latter won't
>> >> >> work.
>> >> >> Tesseract seems to be tuned to work best for scans at 300 dpi
>> >> >> (although I've often successfully used 600 dpi).  Scans done at
>> >> >> 72 dpi usually get very poor results from tesseract.
>> >> >>
>> >> >> Cheers,
>> >> >> Rob Komar
>> >> >
>> >> > --
>> >> > You received this message because you are subscribed to the Google
>> >> > Groups "tesseract-ocr" group.
>> >> > To post to this group, send email to [email protected]
>> >> > To unsubscribe from this group, send email to
>> >> > [email protected]
>> >> > For more options, visit this group at
>> >> > http://groups.google.com/group/tesseract-ocr?hl=en
>> >> >
>> >>
>> >> --
>> >> You received this message because you are subscribed to the Google
>> >> Groups "tesseract-ocr" group.
>> >> To post to this group, send email to [email protected]
>> >> To unsubscribe from this group, send email to
>> >> [email protected]
>> >> For more options, visit this group at
>> >> http://groups.google.com/group/tesseract-ocr?hl=en
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> > Groups "tesseract-ocr" group.
>> > To post to this group, send email to [email protected]
>> > To unsubscribe from this group, send email to
>> > [email protected]
>> > For more options, visit this group at
>> > http://groups.google.com/group/tesseract-ocr?hl=en
>> >
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To post to this group, send email to [email protected]
>> To unsubscribe from this group, send email to
>> [email protected]
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to