Hi there, I have trained a new font containing upper case letters and digits. In the evaluation I found that the most frequent error were 0->O confusions (not the other way around). A total of 38 zeros were recognized as O. Looking through the training images I found a few O that were actually zeros. I removed those from the .box file and redid the training.
As a result I now have 40 confusions 0->O and in addition 78 confusions 8->B! Previously, there have been only 4 confusions 8->B. How can such a small change in one letter have such a big effect on a completely different letter? I noticed that after removing the few O from the .box file that in the corresponding .tr file all the following letters were slightly different. To remove this effect I took the original .tr files and manually removed those O from them. The idea is that the .tr files are used to create the prototypes and that leaving all samples unchanged should result in the same prototypes for each letter. However, the confusions 8->B are now at 37, which is less that 78 but still much more than 4! The 0->O confusions are now at 34, which is only slightly better than the original 38. What exactly is going on when the prototypes are being generated? What does the clustering algorithm do? Some additional information: There are a total of four images, each with its own .box file. The extraneous O were all in the first image file. In the normproto file the number of prototypes for the O has gone down from 5 to only 1 (if I interpret the file correctly). For the digit 0 it has gone from 1 to 2 prototypes. For the letter B the number of prototypes has not changed (2) although the number of training samples that have contributed to the prototypes has shifted slightly (13 to 63 versus 10 to 66). For the digit 8 the number of prototypes has increased from 1 to 3. Given that I have simply removed a few O from the .tr file there have been a lot of changes to the other letters as well. Thanks in advance for any help, insights, hints you can provide. Marcus -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en