Re: [tesseract-ocr] Re: Missing detailed documentation about Unicharset files

Nick White Tue, 15 Jul 2014 07:55:49 -0700

Hi Albrecht,

On Mon, Jul 14, 2014 at 01:10:07PM -0700, Albrecht Hilker wrote: 
> When I download the traineddata files and extract the unicharset file from 
> them
> I notice that some are extremely different from the ones on SVN in the folder
> training/langdata.
> 
> For example:
> Bengali, Hebrew, Greek, Kannada, Malayam, Tamil, Telugu, Thai.
> 
> These files differ significantly.
> So for example Greek has a size of 9 kB in the traineddata file
> tesseract-ocr-3.02.ell.tar.gz  and defines 151 characters.
> But Greek.unicharset in the folder training/langdata has a size of 216 kB and
> defines 2820 unichars.


I am guessing, but it looks likely that Ray/Google has some internal 
tools that look replace any line in the extracted .unicharset with a 
line from the "pregenerated" one in training/langdata. Ray said in 
an email to the dev list some months back that he was planning to 
update the training files a lot soon, so it will be interesting to 
see what lands there.

> The greek alphabet does not have much more characters than the latin alphabet!
> Where do they come from ?

Well, if you include all the different combinations of diacritics 
used in polytonic Greek there are a lot more characters - the first 
350ish characters look like they're taken straight from the relevant 
parts of the Unicode standard.

If look slightly further down that file, you see loads of special 
symbols, including some Hebrew. If you grep around, you'll see that 
they're similar for quite a few of the unicharset files. I would 
again venture a guess that they're just copied in case the training 
decides to include more special characters in the future. But we'll 
have to see the scripts making use of these files to be sure.

> This is another example that shows how important a documenation is.
> The poor users of Tesseract are left alone in the dark and there is nobody who
> turns on the light!

Because lots of cool stuff regarding training has landed from Ray's 
new work, but not everything, it's particularly difficult at the 
moment.  Once more stuff makes it into the repository things should 
get better.

I'll reply to your other email soon.

Nick

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/20140715145405.GB8807%40manta.lan.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Missing detailed documentation about Unicharset files

Reply via email to