I have a document with one page per 600 dpi TIFF file.  The original 
document was created 50 years ago, using a typewriter.
I want the OCR character and its position on the page, so I use makebox:

 tesseract page.tiff box/page makebox

I get from this a file, box/page.box, which I can then use to outline the 
location of the boxes in my original file.
I get boxes that look like (the original TIFF file has only two colors -- 
black and white.  In the following, I convert
the "recognized" bits to blue -- they are in a box -- and draw a one-pixel 
red box around the box defined by
tesseract):

[image: Screenshot from 2020-04-02 18-11-07.png]
The first line looks really good, but the second line shows "a" and "f" in 
their boxes, and then another box that includes both of them (but not all 
of the "a"),
and the same sort of thing later with the "h" and "i" -- both in their own 
boxes, and then another box thrown over the two of them.  And then "nd" in 
one box instead of in separate boxes, even tho there is a clear gutter 
between them.

In another case, we get much more complex overlapping boxes:

[image: Screenshot from 2020-04-02 18-19-33.png]
The first "f" is not recognized at all, while the second one is split into 
3 boxes -- two abut, but one is a really small box inside the top one.  The 
"r" on the top line is two different boxes, as is the later "o", although 
the thickness of the interior red line for the "o" suggests it is actually 
3 boxes, one being only one or two pixels wide.  The "d" and "e" and "a" on 
the second line are each a bunch of overlapping boxes.

I guess my theory of how to do segmentation of the image would be to create 
a set of non-overlapping boxes and make sure that each black bit (or at 
least a cluster of black bits) is in one and only one box.  Clearly this is 
not what tesseract does since some bits end up in many different boxes, and 
other bits are in no box at all.  I have a table of contents page, where 
everything is recognized (not necessarily correctly, but at least it 
identifies the bits, except the column of page numbers which are just 
ignored completely, but that's probably a different problem.

[image: Screenshot from 2020-04-02 18-31-26.png]


The first problem is "How do I get non-overlapping boxes?"



-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/cf968e75-184b-48ba-9608-7fc30e7daf77%40googlegroups.com.

Reply via email to