Maybe the numbers you are complaining about come from the possible use of "old style numerals" like the font Georgia has them. (see old-style-numerals.png) But this is only a guess.
Am Freitag, 4. Juli 2014 06:40:51 UTC+2 schrieb Albrecht Hilker: > > Hello > > Generally it is very sad that there is no detailed documentation about > Tesseract. > > The only documentation about Unicharset file that I could find is this: > > https://tesseract-ocr.googlecode.com/svn-history/r683/trunk/doc/unicharset.5.html > > But this is completely insufficient and not understandable. > > And unicharset_extractor.exe produces wrong and uncomplete files. > So I have to edit them by hand. > But how ? > > I need a detailed explanation how to enter the values for the several > min/max parameters. > > The sparse documentation says that 128 is the x-height. > Does anybody think that with this information one is able to edit a > Unicharset file ??? > > How do I enter the width of a character ? > How do I enter the minimum bottom and the maximum bottom value ? > > And the example given on that page does not make any sense: > > 1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1 > 9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9 > > So this example says that > the character "1" has a min_bottom value of 59 and > the character "9" has a min_bottom value of 18. > > Weird ? ? ? > Both numbers are aligned to the baseline! > > Wouldn't it be more intelligent to define the min_bottom for "9" with a > higher value to distinguish it from a lowercase "g" ?? > > And what about the other values ? > bearing, advance ? > Where do I get them from ? > > The most weird thing is that the training data may contain 32 fonts but there > is only one Unicharset file! > If there was one Unicharset file per font I would understand. > > But in a monospaced font the advance is equal for an "i" and a "W" while in > in Arial they are very different. > How do I create a Unicharset file that must fit for such different fonts ? > > I need a detailed explanation with images (not only text!!) how to obtain > these values. > > > > > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4534003c-6b5b-4a42-b2ea-6fc012699eef%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.