Re: tips for improving Tesseract accuracy and speed...

Dmitri Silaev Thu, 07 Apr 2011 00:34:00 -0700

Sriranga,

Sorry for the delay.


I meant that for training you just need to use as many as possible
*different* images, not multiple renamed copies of the same image.

Warm regards,
Dmitri Silaev





On Mon, Apr 4, 2011 at 2:56 PM, Sriranga(78yrsold)
<[email protected]> wrote:
> Dmitri,
> I am extremely thankful for the valuable guidance.
> With reference to your last para - I could not follow clearly and is in
> confusion. Kindly eloborate little bit with your sample (any lang or
> English) will do. Kindly pardon  me for troubling you in the midst of your
> hectic work.
> With Choicest Best Wishes and Good Luck,
> -sriranga(78yrs)
>
> On Mon, Apr 4, 2011 at 11:50 AM, Dmitri Silaev <[email protected]>
> wrote:
>>
>> Dear Sriranga,
>>
>> Sorry for the delay.
>>
>> You indeed can manually set the DPI in an image file using any image
>> editor, but the only thing that matters is the resolution your image
>> got from the scanner. Roughly saying, the resolution here means the
>> number of pixels per letter. This is controlled by the scanner itself
>> or scanning program settings. By changing DPI afterwards in an image
>> editor, you just change some image's attribute values, not image's
>> pixels.
>>
>> 300 DPI is more than okay for your needs.
>>
>> Renaming a box/train file and feeding it to Tesseract as another
>> sample is not a solution, as by "sample" we here mean a copy of a
>> character we obtained at slightly different conditions in another
>> [scanned] image, or at least at another position in the same image. So
>> get as many images as possible, count the number of character samples
>> within each and thus build your training body.
>>
>> Warm regards,
>> Dmitri Silaev
>>
>>
>>
>>
>>
>> On Sat, Apr 2, 2011 at 1:13 PM, Sriranga(78yrsold)
>> <[email protected]> wrote:
>> > Dear Dimitri,
>> > Awaiting your valuable guidance please.
>> > With warmest regards,
>> > -sriranga(78yrs)
>> >
>> > On Wed, Mar 30, 2011 at 8:29 PM, Sriranga(78yrsold)
>> > <[email protected]> wrote:
>> >>
>> >> Dear Dimitri,
>> >> It is presumed that if the scanned imges has 300 x 300 dpi is
>> >> reasonable?
>> >> With help of Irfanview I can find out dpi as well as increase or
>> >> decrease
>> >> dpi can be done.
>> >> Generally,as a standard I select dpi =300 and resized to 1200 or 2400
>> >> from
>> >> 600 which is convenient for edit the box file with help of owler. Hope
>> >> this
>> >> will not minimise accuracy of the output. Sample tif attached for
>> >> approval.
>> >>
>> >> Regarding 20 samples of each char = Supose, if theimage1. tif file
>> >> contains alphabets of single char can be used 20 times by renaming the
>> >> same
>> >> image file as image1.tif, image2.tif, image3.tif .....image20.tif ? If
>> >> not
>> >> kindly provide me with your  sample, if any.
>> >> With Warmest Regards,
>> >> -sriranga(78yrs)
>> >>
>> >> On Wed, Mar 30, 2011 at 5:12 PM, Dmitri Silaev <[email protected]>
>> >> wrote:
>> >>>
>> >>> Depending on the quality of your source images, I think it'd be
>> >>> reasonable to scale them down in order for letters to have the height
>> >>> of 40 pixels or so. In that way Tesseract will just have to do a bit
>> >>> less work - scan lesser pixels and construct shorter glyph outlines.
>> >>>
>> >>> The accuracy may suffer even for such a considerable char height (90
>> >>> is certainly more than enough) if you have significant discrepancies
>> >>> between training and source images. You should try to pass to
>> >>> Tesseract images having as similar as possible thickness and
>> >>> orientation. To achieve this, you need to pre-process images to get
>> >>> them look alike with respect to lighting conditions, contrast, blur
>> >>> amount, physical dimensions; rectify perspective distortion, etc. And
>> >>> of course, always use the same binarization procedure with the same
>> >>> parameter set, or at least giving predictably similar results for a
>> >>> range of your source images. Btw, using Otsu thresholding prior to
>> >>> passing images to Tesseract is useless as Otsu is a binarization
>> >>> procedure employed by Tesseract itself. Except if you do Otsu with
>> >>> your own special parameter set and then pass a 1-bit image.
>> >>>
>> >>> Next, you should train Tesseract having in mind that ideally there
>> >>> should be around 20 samples of each char. You shouldn't be striving to
>> >>> train using as many as possible char sizes - regardless of the size,
>> >>> Tesseract scales character "models" up or down to the same internal
>> >>> dimensions. But if your source char sizes differ - that's no problem,
>> >>> they'll do. Provide real images (probably pre-processed) images for
>> >>> training, not manually compiled ones.
>> >>>
>> >>> What can be done to further improve the speed and accuracy - process
>> >>> your images char by char, bypassing Tesseract's layout analysis. This
>> >>> approach also perfectly allows to use char-position-specific
>> >>> whitelists (letters, digits) for even more speedup and precision.
>> >>>
>> >>> Everything related to Tesseract's dictionary facility is totally
>> >>> irrelevant here. You'd better provide entirely empty files for your
>> >>> "traineddata".
>> >>>
>> >>> HTH
>> >>>
>> >>> Warm regards,
>> >>> Dmitri Silaev
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Tue, Mar 29, 2011 at 8:17 AM, Andres <[email protected]> wrote:
>> >>> > ...required.
>> >>> >
>> >>> > Hello people,
>> >>> >
>> >>> > I'm develping a licence plate recognition system from long ago and I
>> >>> > still
>> >>> > have to improve the use of Tesseract to make it usable.
>> >>> >
>> >>> > My first concern is about speed:
>> >>> > After extracting the licence plate image, I get an image like this:
>> >>> >
>> >>> >
>> >>> >
>> >>> > https://docs.google.com/leaf?id=0BxkuvS_LuBAzNmRkODhkYTUtNjcyYS00Nzg5LWE0ZDItNWM4YjRkYzhjYTFh&hl=en&authkey=CP-6tsgP
>> >>> >
>> >>> > As you may see, there are only 6 characters (tess is recognizing
>> >>> > more
>> >>> > because there are some blemishes over there, but I get rid of them
>> >>> > with
>> >>> > some
>> >>> > postprocessing of the layout of the recognized chars)
>> >>> >
>> >>> > In an Intel I7 720 (good power, but using a single thread) the
>> >>> > tesseract
>> >>> > part is taking something like 230 ms. This is too much time for what
>> >>> > I
>> >>> > need.
>> >>> >
>> >>> > The image is 500 x 117 pixels. I noted that when I reduce the size
>> >>> > of
>> >>> > this
>> >>> > image the detection time is reduced in proportion with the image
>> >>> > area,
>> >>> > which
>> >>> > makes good sense. But the accuracy of the OCR is poor when the
>> >>> > characters
>> >>> > height is below 90 pixels.
>> >>> >
>> >>> > So, I assume that there is a problem with the way I trained
>> >>> > tesseract.
>> >>> >
>> >>> > Because the characters in the plates are assorted (3 alphanumeric, 3
>> >>> > numeric) I trained it with just a single image with all the letters
>> >>> > in
>> >>> > the
>> >>> > alphabet. I saw that you suggest large training but I imagine that
>> >>> > that
>> >>> > doesn't apply here where the characters are not organized in words.
>> >>> > Am
>> >>> > I
>> >>> > correct with this ?
>> >>> >
>> >>> > So, for you to see, this is the image with what I trained Tesseract:
>> >>> >
>> >>> >
>> >>> >
>> >>> > https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0BxkuvS_LuBAzODc1YjIxNWUtNzIxMS00Yjg3LTljMDctNDkyZGIxZWM4YWVm&hl=en&authkey=CMXwo-AL
>> >>> >
>> >>> > In this image the characters are about 55 pixels height.
>> >>> >
>> >>> > Then, for frequent_word_list and words_list I included a single
>> >>> > entry
>> >>> > for
>> >>> > each character, I mean, something starting with this:
>> >>> >
>> >>> > A
>> >>> > B
>> >>> > C
>> >>> > D
>> >>> > ...
>> >>> >
>> >>> > Do you see something to be improved on what I did ? Should I perhaps
>> >>> > use a
>> >>> > training image with more letters, with more combinations ? Will that
>> >>> > help
>> >>> > somehow ?
>> >>> >
>> >>> > Should I include in the same image a copy the same character set but
>> >>> > with
>> >>> > smaller size ? In that way, will I be able to pass Tesseract smaller
>> >>> > images
>> >>> > and get more speed without sacrificing detection quality ?
>> >>> >
>> >>> >
>> >>> > On the other hand, I found some strange behavior of Tesseract about
>> >>> > which I
>> >>> > would like to know a little more:
>> >>> > In my preprocessing I tried Otsu thresholding
>> >>> > (http://en.wikipedia.org/wiki/Otsu%27s_method) and I visually got
>> >>> > too
>> >>> > much
>> >>> > better results, but surprisingly for Tesseract it was worse. It
>> >>> > decreased
>> >>> > the thickness of the draw of the chars, and the chars I used to
>> >>> > train
>> >>> > Tesseract were bolder. So, Tesseract matches the "boldness" of the
>> >>> > characters ? Should I train Tesseract with different levels of
>> >>> > boldness
>> >>> > ?
>> >>> >
>> >>> > I'm using Tesseract 2.04 for this. Do you think that some of these
>> >>> > issues
>> >>> > will go better by using Tess 3.0 ?
>> >>> >
>> >>> >
>> >>> > Thanks,
>> >>> >
>> >>> > Andres
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> > --
>> >>> > You received this message because you are subscribed to the Google
>> >>> > Groups
>> >>> > "tesseract-ocr" group.
>> >>> > To post to this group, send email to [email protected].
>> >>> > To unsubscribe from this group, send email to
>> >>> > [email protected].
>> >>> > For more options, visit this group at
>> >>> > http://groups.google.com/group/tesseract-ocr?hl=en.
>> >>> >
>> >>>
>> >>> --
>> >>> You received this message because you are subscribed to the Google
>> >>> Groups
>> >>> "tesseract-ocr" group.
>> >>> To post to this group, send email to [email protected].
>> >>> To unsubscribe from this group, send email to
>> >>> [email protected].
>> >>> For more options, visit this group at
>> >>> http://groups.google.com/group/tesseract-ocr?hl=en.
>> >>>
>> >>
>> >
>> >
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: tips for improving Tesseract accuracy and speed...

Reply via email to