Problem #1: as long as the components don't touch, and the boxes don't
overlap, the bounding boxes don't have to be accurate, but you can't
currently use two boxes to split joined characters if I remember correctly.
You could however paint a white strip in the image between the boxes to
break the characters apart.
Problem#2: you can delete as many boxes from the box file as you like.
Unboxed components in the image are harmless. The only caveat is to make
sure the tr files get to mftraining in the same order as they get to
unicharset_extractor.

Ray.

On Tue, Nov 4, 2008 at 3:26 AM, Ruwan Janapriya <[EMAIL PROTECTED]> wrote:

> Dear All,
>
> I am curious about the following. It would be a great help if someone can
> answer these questions.
>
> Lets say, that I have created a box file using a tiff image. Ideally the
> box file should contain the bounding boxes of each character. But as we all
> know, if we use a scanned image there can be many problems.
>
> *Problem #1*
> We can have a box covering two (or more) characters instead of one
> character. As I know there are two options. The first options is, just
> consider this as a single character and insert two (or more) corresponding
> unicode characters under that box. The second option is, split the box in
> the way the "training" wiki suggested [1].
>
> Now my question is what if we modify the coordinates of the boxes as we
> wish? Just enlarge a bit or shrink a bit (without overlapping other boxes)?
>
> *Problem #2*
> We can have boxes just covering *non charactors* (e.g. dark patches, noise
> etc..).
>
> Now my question is, what if we delete these boxes and proceed? What is the
> impact? Can't we say to tesseract that these charactors are just "non
> charactors"?
>
> [1] Lets say the diagonal coordinates of the box is [(TLx, TLy), (BRx,
> BRy)] here, Bottom Right: BR, Top Left: TL
> Now after splitting following boxes will result, [(TLx, TLy), (TLx / 2 +
> BRx / 2, BRy)]  and [(TLx / 2 + BRx / 2, TLy), (BRx, BRy)]
>
> P.S. I wrote JTesseract - a front end for Tesseract training process.
> Answers to these questions would greatly improve that application.
>
> regards,
>
> --
> *Ruwan Janapriya *
> http://www.janapriya.net
>
>
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to