Re: [tesseract-ocr] Re: Font Limit = 64 fonts in traineddata, really ??

2014-07-09 Thread Nick White
On Tue, Jul 08, 2014 at 10:49:49PM -0700, shree wrote:
> My information IS dated - I haven't followed the recent changes. Please see
> this thread -  almost a year old which talked of the upcoming changes for
> training  
> 
> https://groups.google.com/forum/#!searchin/tesseract-dev/fonts/tesseract-dev/
> 4lxGjCGLBSw/CH1cZsovPjIJ

This thread only really has information about the new training 
tools; I don't think any major changes in the formats / limits of 
things are planned. Those new training tools do exist in SVN now, 
incidentally; see the training/ and training/langdata directories, 
and if you're curious to see how they can be used, check out the 
Makefile of my training[0].

Albrecht, thanks for digging around like this and finding 
inconsistencies in the documentation. I haven't looked at the font 
limits myself, so will try to dip into the code soon to see if I can 
figure out a more definitive answer. If you get there first, let me 
know and I can update the TrainingTesseract3 page as appropriate.

Nick

0. git clone http://ancientgreekocr.org/grc.git

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/20140709173838.GF9792%40manta.lan.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Font Limit = 64 fonts in traineddata, really ??

2014-07-08 Thread shree
My information IS dated - I haven't followed the recent changes. Please see 
this thread -  almost a year old which talked of the upcoming changes for 
training  

https://groups.google.com/forum/#!searchin/tesseract-dev/fonts/tesseract-dev/4lxGjCGLBSw/CH1cZsovPjIJ



On Wednesday, July 9, 2014 2:18:39 AM UTC+5:30, Albrecht Hilker wrote:
>
>
> > As far as I understand, the font limitation applies up to tesseract 
> 3.02. Major changes to training are currently in the works in SVN for 3.03
>
> The files I am talking about are downloaded from 
> https://code.google.com/p/tesseract-ocr/downloads/list
>
> They are all declared as version 3.02.
> For example: tesseract-ocr-3.02.eng.tar.gz 
> 
>
>
> > hence you see large number of fonts for english traineddata but not for 
> others
>
> This is not correct.
> The spanish traineddata has the same 358 fonts.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/365a9fc8-f1f7-4b5c-ad5b-35651f2328c0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Font Limit = 64 fonts in traineddata, really ??

2014-07-08 Thread Albrecht Hilker

> As far as I understand, the font limitation applies up to tesseract 3.02. 
Major changes to training are currently in the works in SVN for 3.03

The files I am talking about are downloaded from 
https://code.google.com/p/tesseract-ocr/downloads/list

They are all declared as version 3.02.
For example: tesseract-ocr-3.02.eng.tar.gz 



> hence you see large number of fonts for english traineddata but not for 
others

This is not correct.
The spanish traineddata has the same 358 fonts.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/44eb7946-edc4-4cfe-ba97-e5d2883f53ca%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Font Limit = 64 fonts in traineddata, really ??

2014-07-08 Thread Shree Devi Kumar
As far as I understand, the font limitation applies up to tesseract 3.02.

Major changes to training are currently in the works in SVN for 3.03 (not
fully released yet - hence you see large number of fonts for english
traineddata but not for others). The other languages traineddata maybe
forthcoming in future.

Ray/Zdenko/Nick may be able to give an idea of expected timeline for
release.

Shree Devi Kumar

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


On Tue, Jul 8, 2014 at 5:04 PM, Paul  wrote:

> If you have a look at intproto.h, you'll see there is a similar
> limitation, bit it's much more complicated. Unfortunately I don't have an
> overview of what is possible yet, but I'm working on it. :) Just use
> normproto.h as a reference.
>
> Am Dienstag, 8. Juli 2014 02:55:37 UTC+2 schrieb Albrecht Hilker:
>
>> The manual "Training Tesseract 3" says:
>>
>> > Tesseract needs to know about different shapes of the same character by
>> having different fonts separated explicitly.
>> > This used to be limited to 32 fonts, but the limit has been raised to
>> 64.
>> > It is set by the constant MAX_NUM_CONFIGS defined in intproto.h.
>> > Note that runtime is heavily dependent on the number of fonts provided,
>> and training more than 32 will result in a significant slow-down.
>>
>>
>>
>> I analyzed the number of fonts in eng.traineddata and I was very
>> surprised that there have been 358 fonts trained !
>> get_fontinfo_table().size() returns 358 !
>>
>>
>> Can anybody explain me this contradiction ?
>>
>>
>>
>>
>> Fonts in eng.traineddata:
>>
>>  AR_PL_UKai_CN,
>>  AR_PL_UKai_Patched,
>>  AR_PL_UKai_TW,
>>  AR_PL_UMing_CN_Light,
>>  AR_PL_UMing_Patched_Light,
>>  AR_PL_UMing_TW_MBE_Light,
>>  Aboriginal_Sans,
>>  Aboriginal_Sans_Bold_Italic,
>>  Aboriginal_Sans_Italic,
>>  Aboriginal_Serif,
>>  Aboriginal_Serif_Bold,
>>  Aboriginal_Serif_Bold_Italic,
>>  Aboriginal_Serif_Italic,
>>  Abyssinica_SIL,
>>  AlArabiya,
>>  AlBattar,
>>  AlHor,
>>  AlManzomah,
>>  AlMohanad,
>>  Andale_Mono,
>>  Ani,
>>  AnjaliOldLipi,
>>  Arab,
>>  Arial,
>>  Arial_Black,
>>  Arial_Bold,
>>  Arial_Bold_Italic,
>>  Arial_Italic,
>>  BPG_Chveulebrivi,
>>  BPG_Chveulebrivi_Bold,
>>  BPG_Courier,
>>  BPG_Courier_Bold,
>>  BPG_Elite,
>>  BPG_Elite_Bold,
>>  BPG_Glaho,
>>  BPG_Glaho_Bold,
>>  BPG_Rioni,
>>  BPG_Rioni_Bold,
>>  BPG_Unicode_Standard,
>>  Baekmuk_Batang,
>>  Baekmuk_Batang_Patched,
>>  Baekmuk_Dotum,
>>  Baekmuk_Gulim,
>>  Baekmuk_Headline,
>>  Bangla,
>>  Bitstream_Vera_Sans,
>>  Bitstream_Vera_Sans_Bold,
>>  Bitstream_Vera_Sans_Bold_Oblique,
>>  Bitstream_Vera_Sans_Mono,
>>  Bitstream_Vera_Sans_Mono_Bold,
>>  Bitstream_Vera_Sans_Mono_Bold_Oblique,
>>  Bitstream_Vera_Sans_Mono_Oblique,
>>  Bitstream_Vera_Sans_Mono_Roman,
>>  Bitstream_Vera_Sans_Oblique,
>>  Bitstream_Vera_Sans_Roman,
>>  Bitstream_Vera_Serif,
>>  Bitstream_Vera_Serif_Bold,
>>  Bitstream_Vera_Serif_Roman,
>>  CaslonishFraxx,
>>  Century_Schoolbook_L,
>>  Century_Schoolbook_L_Bold,
>>  Century_Schoolbook_L_Bold_Italic,
>>  Century_Schoolbook_L_Italic,
>>  Century_Schoolbook_L_Roman,
>>  Chandas,
>>  Cloister_Black_Light,
>>  Comic_Sans_MS,
>>  Comic_Sans_MS_Bold,
>>  Cortoba,
>>  Courier_New,
>>  Courier_New_Bold,
>>  Courier_New_Bold_Italic,
>>  Courier_New_Italic,
>>  DejaVu_Sans,
>>  DejaVu_Sans_Bold,
>>  DejaVu_Sans_Bold_Oblique,
>>  DejaVu_Sans_Condensed,
>>  DejaVu_Sans_Condensed_Bold,
>>  DejaVu_Sans_Condensed_Bold_Oblique,
>>  DejaVu_Sans_Condensed_Oblique,
>>  DejaVu_Sans_Mono,
>>  DejaVu_Sans_Mono_Bold,
>>  DejaVu_Sans_Mono_Bold_Oblique,
>>  DejaVu_Sans_Mono_Oblique,
>>  DejaVu_Sans_Oblique,
>>  DejaVu_Sans_Ultra-Light,
>>  DejaVu_Serif,
>>  DejaVu_Serif_Bold,
>>  DejaVu_Serif_Bold_Italic,
>>  DejaVu_Serif_Bold_Oblique,
>>  DejaVu_Serif_Bold_Semi-Condensed,
>>  DejaVu_Serif_Condensed_Bold,
>>  DejaVu_Serif_Condensed_Bold_Italic,
>>  DejaVu_Serif_Condensed_Italic,
>>  DejaVu_Serif_Italic,
>>  DejaVu_Serif_Oblique,
>>  DejaVu_Serif_Semi-Condensed,
>>  Dimnah,
>>  Dustismo,
>>  Dustismo_Roman,
>>  Dustismo_Roman_Bold,
>>  Dustismo_Roman_Italic,
>>  Dustismo_Roman_Italic_Bold,
>>  Dyuthi,
>>  East_Syriac_Adiabene,
>>  East_Syriac_Ctesiphon,
>>  Electron,
>>  Estrangelo_Antioch,
>>  Estrangelo_Edessa,
>>  Estrangelo_Midyat,
>>  Estrangelo_Nisibin,
>>  Estrangelo_Quenneshrin,
>>  Estrangelo_Talada,
>>  Estrangelo_TurAbdin,
>>  FreeMono,
>>  FreeMono_Bold,
>>  FreeMono_Bold_Italic,
>>  FreeMono_Bold_Oblique,
>>  FreeMono_Italic,
>>  FreeMono_Oblique,
>>  FreeSans,
>>  FreeSans_Bold,
>>  FreeSans_Bold_Oblique,
>>  FreeSans_Oblique,
>>  FreeSerif,
>>  FreeSerif_Bold,
>>  FreeSerif_Bold_Italic,
>>  FreeSerif_Italic,
>>  Furat,
>>  Garuda,
>>  Garuda_Bold,
>>  Garuda_Bold_Oblique,
>>  Garuda_Oblique,
>>  GentiumAlt,
>>  GentiumAlt_Italic,
>>  Georgia,
>>  Georgia_Bold,
>>  Georgia_Bold_Italic,
>>  Georgia_Italic,
>>  Granada,
>>  Graph,
>>  Hani,
>>  Haramain,
>>  Hor,
>>  IPAGothic,
>>  IPAMincho,