Re: [tesseract-ocr] Missing detailed documentation about Unicharset files

Nick White Tue, 15 Jul 2014 08:22:58 -0700

Hi again,

On Mon, Jul 14, 2014 at 09:38:26AM -0700, Albrecht Hilker wrote: 
> After some days I came back here and I'm very surprised about your lots of
> posts.
> Thanks for answering and taking the time.


As you may have noticed, there aren't too many people around here 
who are comfortable looking into why things are they way they are - 
I'm very happy to read and learn and take time to answer when people 
have done so!

> I think all the problems that I described can easily be fixed except the min/
> max values.
> 
> And I still don't understand the basic question:
> How can we ever write ONE Unicharset file with font metrics for a whole bunch
> of completely different and contradicting fonts ?
> If there was one unicharset file per font, it would be easier.
> But ONE Unicharset file with min/max values for 358 fonts seems completely
> unsane for me!
> Did you know that the english and the spanish traineddata for 3.02 were 
> trained
> with 358 fonts ?
> https://groups.google.com/forum/?fromgroups#!topic/tesseract-ocr/boQ188SeFsY
> 
> There are fonts that put the "9" below the baseline and other that do not.
> How do we ever write a Unicharset for such different fonts ?
> It simply doesn't make sense to me.

>From browsing the code, it looks like it's basically used to do some 
scoring of a few things, or determine whether the letter seems to be 
subscript or superscript, or help determine x-height, or table 
detection.

One unicharset file for all fonts is indeed slightly problematic, 
but presumably in general the sorts of shapes and sizes are common 
enough for each character that it's still useful. Frankly part of 
the reason it's done this way is probably historical, from back 
before Tesseract was generally trained with many fonts.

> Why does Tesseract need these min/max values at all ?
> Wouldn't it be much more intelligent to store this information directly in the
> feature data ?
> So each character brings the information about it's baseline, height etc, 
> along
> with the training data ?
> These values could be easier to auto-generate.

Sounds sensible to me.

> And the other thing that I absolutely don't understand:
> You are investigating about this topic now.
> But where are the people who know ?
> Is this only Ray ?

Yes, Ray is basically responsible for everything Tesseract. Other 
people are brought in to do various things, but he is the one 
continuous developer, to my knowledge.

Zdenko does regular fix-ups and improvements, but the bulk of the 
work is done by Ray. And he works by making improvements in a 
private repository, and periodically merging it back to the SVN 
repository. It is not ideal, and certainly a community of interested 
people openly bouncing ideas off one another would be nice, but that 
doesn't happen a lot at the moment. It does a bit on the -dev list.

> Google is one of the richest companies on earth.
> Are they not able to pay one of the persons who knows to write a documentation
> (at least part time) ?

Well Google has the advantage of having Ray, who can just explain 
things to anyone there who wants to understand some part of 
Tesseract. It would be nice for them to fund it more, but they don't 
really *need* to. Google aren't the only profitable company using 
Tesseract in their products, though. It would be nice if another 
company sponsored someone to improve the documentation, or just gave 
their employees enough free time to contribute back once they'd 
figured something non-obvious out. To an extent that's what I do, 
but it's all rather ad-hoc.

> One of the persons who work on the code will require let's say a month to 
> write
> a good documentation about Tesseract, which currently is completely abandoned.

Well, I work on the Tesseract documentation, so I'd like to think of 
it as not "completely abandoned" ;) I've been focused on more 
end-user things, partly because they cover the sorts of questions I 
see a lot on this mailing list, and partly because most people don't 
want to think about the code at all.

You'd clearly like more details on how the code works, and how each 
part of the training data is used to generate results. I'd like to 
do more of this, not least because it would improve my understanding 
of the codebase, but ultimately I have limited time and haven't got 
around to it yet. Are there particular things you'd like 
documentated, that I could start on?

Nick

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/20140715152204.GC8807%40manta.lan.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Missing detailed documentation about Unicharset files

Reply via email to