Please check the last section on https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
Regarding combining files to know the correct syntax for building the new traineddata file. On 08-Dec-2017 8:04 AM, "J Klein" <jetm...@gmail.com> wrote: > > > On Thursday, December 7, 2017 at 9:02:11 PM UTC-5, shree wrote: >> >> Re smaller traineddata size, it could possibly be related to the word >> list dictionary size. >> >> You can unpack the original traineddata and compare the word list size >> with the one you used. >> > > > Thank you for the hint. > > I ran the following (-u is 'unpack all' I think), > > combine_tessdata -u /usr/local/share/tessdata/eng.traineddata eng. > > and I got: > > -rw-r--r-- 1 klein staff 11689099 Dec 7 21:22 eng.lstm > > -rw-r--r-- 1 klein staff 4738 Dec 7 21:22 eng.lstm-number-dawg > > -rw-r--r-- 1 klein staff 4322 Dec 7 21:22 eng.lstm-punc-dawg > > -rw-r--r-- 1 klein staff 1012 Dec 7 21:22 eng.lstm-recoder > > -rw-r--r-- 1 klein staff 6360 Dec 7 21:22 eng.lstm-unicharset > > -rw-r--r-- 1 klein staff 3694794 Dec 7 21:22 eng.lstm-word-dawg > > -rw-r--r-- 1 klein staff 80 Dec 7 21:22 eng.version -- CONTENT > is 4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3, > 3Lfys64Lfx96Lrx96Lfx512O1c1] > > > Now I tried to unpack the one I created by adding the characters, and I get > > > xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx eng.lstm is missing! > > -rw-r--r-- 1 klein staff 3506 Dec 7 21:26 eng.lstm-number-dawg > > -rw-r--r-- 1 klein staff 4322 Dec 7 21:26 eng.lstm-punc-dawg > > -rw-r--r-- 1 klein staff 1030 Dec 7 21:26 eng.lstm-recoder > > -rw-r--r-- 1 klein staff 9379 Dec 7 21:26 eng.lstm-unicharset > > -rw-r--r-- 1 klein staff 4153402 Dec 7 21:26 eng.lstm-word-dawg > > -rw-r--r-- 1 klein staff 12 Dec 7 21:26 eng.version -- CONTENT > IS '4.00.00alpha' > > So you're right that the word-list is different. > > But more importantly it seems that eng.lstm isn't in the final > eng.traineddata. Do I not understand something about how the process > works? Is this my mistake, or a glitch! > > Thanks for helping me to make progress. > > > > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/0dc37684-c454-4993-9387-ad641f22f016% > 40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/0dc37684-c454-4993-9387-ad641f22f016%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX8XjPY3kptPsT1wsyFc%3D_JRZ9U%2Bdx9M681SJ3ZfgqMJw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.