Hi Albrecht, On Mon, Jul 14, 2014 at 01:10:07PM -0700, Albrecht Hilker wrote: > When I download the traineddata files and extract the unicharset file from > them > I notice that some are extremely different from the ones on SVN in the folder > training/langdata. > > For example: > Bengali, Hebrew, Greek, Kannada, Malayam, Tamil, Telugu, Thai. > > These files differ significantly. > So for example Greek has a size of 9 kB in the traineddata file > tesseract-ocr-3.02.ell.tar.gz and defines 151 characters. > But Greek.unicharset in the folder training/langdata has a size of 216 kB and > defines 2820 unichars.
I am guessing, but it looks likely that Ray/Google has some internal tools that look replace any line in the extracted .unicharset with a line from the "pregenerated" one in training/langdata. Ray said in an email to the dev list some months back that he was planning to update the training files a lot soon, so it will be interesting to see what lands there. > The greek alphabet does not have much more characters than the latin alphabet! > Where do they come from ? Well, if you include all the different combinations of diacritics used in polytonic Greek there are a lot more characters - the first 350ish characters look like they're taken straight from the relevant parts of the Unicode standard. If look slightly further down that file, you see loads of special symbols, including some Hebrew. If you grep around, you'll see that they're similar for quite a few of the unicharset files. I would again venture a guess that they're just copied in case the training decides to include more special characters in the future. But we'll have to see the scripts making use of these files to be sure. > This is another example that shows how important a documenation is. > The poor users of Tesseract are left alone in the dark and there is nobody who > turns on the light! Because lots of cool stuff regarding training has landed from Ray's new work, but not everything, it's particularly difficult at the moment. Once more stuff makes it into the repository things should get better. I'll reply to your other email soon. Nick -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/20140715145405.GB8807%40manta.lan. For more options, visit https://groups.google.com/d/optout.