Re: Newbie tesseract training question

Dmitri Silaev Tue, 29 Mar 2011 07:39:57 -0700

Sriranga,

Actually, I was a bit inaccurate with what I said. I'd rather say 2nd
and 4th rows, not the last row.


I'll explain. If you take a look at the image Robin provided, you can
see that the 1st and 3rd rows have the same character set
(0123456789), which however differs from the one the 2nd and 4th rows
have (019ABCDEF). For the benefit of easier box file editing, we can
run the box file generation procedure two times, each one with its own
whitelist (which in fact represents the character set). In that way
you are able to obtain *almost* exactly detected characters, but in
two passes. Say you've got the "1.box" and "2.box" files resulted from
the first and second passes of box generation procedure. To compile
the final box file all you'd need to do is just cut and paste into it
the fragments in the following order:

- from "1.box", a fragment corresponding to the 1st row of the image,
- from "2.box", a fragment corresponding to the 2st row of the image,
- from "1.box", a fragment corresponding to the 3st row of the image,
- from "2.box", a fragment corresponding to the 4st row of the image.

In cases when you need to do massive proofreading, this approach can
save you many keyboard hits. But! You'll still need to check the
output and make corrections because the font is being "bootstrapped",
i.e. the font is currently unknown to Tesseract. This new font
resembles the font Tesseract already knows (eng.traineddata) only in
some degree, so it needs to work hard to find good matches, and
sometimes, it just can't.

Hope, now it's a bit more clear,

Warm regards,
Dmitri Silaev





On Tue, Mar 29, 2011 at 6:10 PM, Sriranga(78yrsold)
<[email protected]> wrote:
> Dimitri,
> I am extremely very happy and appreciate your efforts for the benefit of
> Newbies. especially,step by step - tesseract training procedure -  are  very
> very clearly,  lucidly, explained.
> However with reference to your last para "Then cut and paste the
> parts corresponding to the last row and other rows into the new file
> say "led.res". In that way you would *almost* need no manual box file
> editing!"  - This I could not follow. Will  you kindly explain little bit
> with sample
> "led.res" for benefit of newbies. Your method can be used for the Indic
> lang.- for training purpose.
>
> With Warmest Regards,
> -sriranga(78yrsold)
>
>
> On Tue, Mar 29, 2011 at 6:39 PM, Dmitri Silaev <[email protected]>
> wrote:
>>
>> Damn, it really does 8-|
>> Sending a zip
>>
>>
>>
>> On Tue, Mar 29, 2011 at 3:27 PM, Robin <[email protected]> wrote:
>> > Apologies - thought it was shared,  should be now.
>> >
>> > Robin
>> >
>> > On Mar 29, 7:54 am, Robin <[email protected]> wrote:
>> >>
>> >> https://docs.google.com/leaf?id=0BztPAgXftYsqZjRjY2ZkZTYtMjI1Yi00NmU2...
>> >>
>> >> Robin
>> >>
>> >> On Mar 29, 7:46 am, zdenko podobny <[email protected]> wrote:
>> >>
>> >>
>> >>
>> >> > Can you provide example image file (TainingMontage.png)?
>> >>
>> >> > Zdenko
>> >>
>> >> > On Mon, Mar 28, 2011 at 11:12 PM, Robin <[email protected]>
>> >> > wrote:
>> >> > > Hi,
>> >>
>> >> > > I'm reasonably new to tesseract and am trying to train it to
>> >> > > recognise
>> >> > > hex characters from a dot matrix LED display.  The characters are
>> >> > > clear and well spaced, but the box file generation always results
>> >> > > in
>> >> > > "Empty page".
>> >>
>> >> > > I'm using tesseract 3, installed from tesseract-ocr-setup-3.00.exe.
>> >>
>> >> > > The command line I'm using is...
>> >>
>> >> > > tesseract d:\data\TainingMontage.png d:\data\training\led.exp0
>> >> > > batch.nochop makebox
>> >>
>> >> > > Changing my trainging image to the eurotxet.tif example provided
>> >> > > works
>> >> > > as documented in the training notes.
>> >>
>> >> > > I think my trouble lies in the resolution of the individual
>> >> > > characters.  Each character in the display is a 7 high x 5 wide dot
>> >> > > matrix.  I have created a training image with a lot of characters.
>> >>
>> >> > > Any tips?
>> >>
>> >> > > Thanks
>> >>
>> >> > > --
>> >> > > You received this message because you are subscribed to the Google
>> >> > > Groups
>> >> > > "tesseract-ocr" group.
>> >> > > To post to this group, send email to
>> >> > > [email protected].
>> >> > > To unsubscribe from this group, send email to
>> >> > > [email protected].
>> >> > > For more options, visit this group at
>> >> > >http://groups.google.com/group/tesseract-ocr?hl=en.-Hide quoted text
>> >> > > -
>> >>
>> >> > - Show quoted text -- Hide quoted text -
>> >>
>> >> - Show quoted text -
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> > Groups "tesseract-ocr" group.
>> > To post to this group, send email to [email protected].
>> > To unsubscribe from this group, send email to
>> > [email protected].
>> > For more options, visit this group at
>> > http://groups.google.com/group/tesseract-ocr?hl=en.
>> >
>> >
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to
>> [email protected].
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Newbie tesseract training question

Reply via email to