Hi there,

I have trained a new font containing upper case letters and digits. In the 
evaluation I found that the most frequent error were 0->O confusions (not 
the other way around). A total of 38 zeros were recognized as O. Looking 
through the training images I found a few O that were actually zeros. I 
removed those from the .box file and redid the training.

As a result I now have 40 confusions 0->O and in addition 78 confusions 
8->B! Previously, there have been only 4 confusions 8->B.

How can such a small change in one letter have such a big effect on a 
completely different letter?

I noticed that after removing the few O from the .box file that in the 
corresponding .tr file all the following letters were slightly different. 
To remove this effect I took the original .tr files and manually removed 
those O from them. The idea is that the .tr files are used to create the 
prototypes and that leaving all samples unchanged should result in the same 
prototypes for each letter.

However, the confusions 8->B are now at 37, which is less that 78 but still 
much more than 4! The 0->O confusions are now at 34, which is only slightly 
better than the original 38.

What exactly is going on when the prototypes are being generated? What does 
the clustering algorithm do?

Some additional information: There are a total of four images, each with 
its own .box file. The extraneous O were all in the first image file. In 
the normproto file the number of prototypes for the O has gone down from 5 
to only 1 (if I interpret the file correctly). For the digit 0 it has gone 
from 1 to 2 prototypes. For the letter B the number of prototypes has not 
changed (2) although the number of training samples that have contributed 
to the prototypes has shifted slightly (13 to 63 versus 10 to 66). For the 
digit 8 the number of prototypes has increased from 1 to 3.

Given that I have simply removed a few O from the .tr file there have been 
a lot of changes to the other letters as well.

Thanks in advance for any help, insights, hints you can provide.

Marcus

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to