Maybe the numbers you are complaining about come from the possible use of 
"old style numerals" like the font Georgia has them. (see 
old-style-numerals.png) But this is only a guess.

Am Freitag, 4. Juli 2014 06:40:51 UTC+2 schrieb Albrecht Hilker:
>
> Hello
>
> Generally it is very sad that there is no detailed documentation about 
> Tesseract.
>
> The only documentation about Unicharset file that I could find is this:
>
> https://tesseract-ocr.googlecode.com/svn-history/r683/trunk/doc/unicharset.5.html
>
> But this is completely insufficient and not understandable.
>
> And unicharset_extractor.exe produces wrong and uncomplete files.
> So I have to edit them by hand.
> But how ?
>
> I need a detailed explanation how to enter the values for the several 
> min/max parameters.
>
> The sparse documentation says that 128 is the x-height.
> Does anybody think that with this information one is able to edit a 
> Unicharset file ???
>
> How do I enter the width of a character ?
> How do I enter the minimum bottom and the maximum bottom value ?
>
> And the example given on that page does not make any sense:
>
> 1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
> 9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
>
> So this example says that
> the character "1" has a min_bottom value of 59 and
> the character "9" has a min_bottom value of 18.
>
> Weird ? ? ?
> Both numbers are aligned to the baseline!
>
> Wouldn't it be more intelligent to define the min_bottom for "9" with a 
> higher value to distinguish it from a lowercase "g" ??
>
> And what about the other values ?
> bearing, advance ?
> Where do I get them from ?
>
> The most weird thing is that the training data may contain 32 fonts but there 
> is only one Unicharset file!
> If there was one Unicharset file per font I would understand.
>
> But in a monospaced font the advance is equal for an "i" and a "W" while in 
> in Arial they are very different.
> How do I create a Unicharset file that must fit for such different fonts ?
>
> I need a detailed explanation with images (not only text!!) how to obtain 
> these values.
>
>
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4534003c-6b5b-4a42-b2ea-6fc012699eef%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to