On Thursday, December 7, 2017 at 9:02:11 PM UTC-5, shree wrote: > > Re smaller traineddata size, it could possibly be related to the word list > dictionary size. > > You can unpack the original traineddata and compare the word list size > with the one you used. >
Thank you for the hint. I ran the following (-u is 'unpack all' I think), combine_tessdata -u /usr/local/share/tessdata/eng.traineddata eng. and I got: -rw-r--r-- 1 klein staff 11689099 Dec 7 21:22 eng.lstm -rw-r--r-- 1 klein staff 4738 Dec 7 21:22 eng.lstm-number-dawg -rw-r--r-- 1 klein staff 4322 Dec 7 21:22 eng.lstm-punc-dawg -rw-r--r-- 1 klein staff 1012 Dec 7 21:22 eng.lstm-recoder -rw-r--r-- 1 klein staff 6360 Dec 7 21:22 eng.lstm-unicharset -rw-r--r-- 1 klein staff 3694794 Dec 7 21:22 eng.lstm-word-dawg -rw-r--r-- 1 klein staff 80 Dec 7 21:22 eng.version -- CONTENT is 4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1] Now I tried to unpack the one I created by adding the characters, and I get xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx eng.lstm is missing! -rw-r--r-- 1 klein staff 3506 Dec 7 21:26 eng.lstm-number-dawg -rw-r--r-- 1 klein staff 4322 Dec 7 21:26 eng.lstm-punc-dawg -rw-r--r-- 1 klein staff 1030 Dec 7 21:26 eng.lstm-recoder -rw-r--r-- 1 klein staff 9379 Dec 7 21:26 eng.lstm-unicharset -rw-r--r-- 1 klein staff 4153402 Dec 7 21:26 eng.lstm-word-dawg -rw-r--r-- 1 klein staff 12 Dec 7 21:26 eng.version -- CONTENT IS '4.00.00alpha' So you're right that the word-list is different. But more importantly it seems that eng.lstm isn't in the final eng.traineddata. Do I not understand something about how the process works? Is this my mistake, or a glitch! Thanks for helping me to make progress. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0dc37684-c454-4993-9387-ad641f22f016%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.