Hi Shree, I am actually learning about create a new language traineddata for new languages. I would also like to contribute for tesseract.
For this I am learning this. I have followed all your post as well as you projects on github. (Wanted to thank you for helping and contributing so many things online :)) I have already tried fine-tuning English language. Is there any information about why we need these files (devenagari.unicharset, Latin.unicharset and radical-stroke.txt) ? and do we need to use these files for new language like Chattisgarhi or any other language which is not available for tesseract?? Any help will be appreciated. On Wednesday, 8 April 2020 21:58:37 UTC+5:30, shree wrote: > > Why do you want to fine-tune eng to get to hindi traineddata? > > You can fine-tune hin.traineddata or script/Devanagari.traineddata. > > On Wed, Apr 8, 2020, 21:00 Piyush Chandra <[email protected] > <javascript:>> wrote: > >> When I downloaded the devenagari.unicharset, Latin.unicharset and >> radical-stroke.txt >> , it worked. What are these files and why we need this? Do we need to use >> these every time we work for new language or we need to create our own??? >> >> >> On Wednesday, 8 April 2020 20:42:44 UTC+5:30, Piyush Chandra wrote: >>> >>> Hi, >>> >>> I am trying to create a hindi traineddata from scratch using >>> eng.traineddata. >>> >>> I used some png and txt files to create box file using lstmbox and >>> edited those box files to correct the words. >>> >>> Then, I used lstm.train to create lstm files and created unicharset file >>> from the box files using unicharset_extractor. >>> >>> But now, when i use combine_lang_model to get starter traineddata file I >>> am getting error. Please help. >>> >>> ~/hindiFiles/hindi$ /usr/local/bin/combine_lang_model --input_unicharset >>> ./langdata/hin/hin.unicharset --script_dir ./langdata --words >>> ./langdata/hin.wordlist --numbers ./langdata/hin.numbers --puncs >>> ./langdata/hin.punc --output_dir /home/piyush/hindiFiles/hindi/langdata/ >>> --lang hin >>> Loaded unicharset of size 39 from file ./langdata/hin/hin.unicharset >>> Setting unichar properties >>> Setting script properties >>> Failed to load script unicharset from:./langdata/Latin.unicharset >>> Failed to load script unicharset from:./langdata/Devanagari.unicharset >>> Warning: properties incomplete for index 3 = मे >>> Warning: properties incomplete for index 4 = रा >>> Warning: properties incomplete for index 5 = ना >>> Warning: properties incomplete for index 6 = म >>> Warning: properties incomplete for index 7 = पी >>> Warning: properties incomplete for index 8 = यू >>> Warning: properties incomplete for index 9 = ष >>> Warning: properties incomplete for index 10 = है >>> Warning: properties incomplete for index 11 = । >>> Warning: properties incomplete for index 12 = हाँ >>> Warning: properties incomplete for index 13 = , >>> Warning: properties incomplete for index 14 = मु >>> Warning: properties incomplete for index 15 = झे >>> Warning: properties incomplete for index 16 = भू >>> Warning: properties incomplete for index 17 = ख >>> Warning: properties incomplete for index 18 = ल >>> Warning: properties incomplete for index 19 = गी >>> Warning: properties incomplete for index 20 = तु >>> Warning: properties incomplete for index 21 = म् >>> Warning: properties incomplete for index 22 = हा >>> Warning: properties incomplete for index 23 = क् >>> Warning: properties incomplete for index 24 = या >>> Warning: properties incomplete for index 25 = कै >>> Warning: properties incomplete for index 26 = से >>> Warning: properties incomplete for index 27 = हो >>> Warning: properties incomplete for index 28 = ? >>> Warning: properties incomplete for index 29 = क >>> Warning: properties incomplete for index 30 = ब >>> Warning: properties incomplete for index 31 = त >>> Warning: properties incomplete for index 32 = आ >>> Warning: properties incomplete for index 33 = ओ >>> Warning: properties incomplete for index 34 = गे >>> Warning: properties incomplete for index 35 = नीं >>> Warning: properties incomplete for index 36 = द >>> Warning: properties incomplete for index 37 = र >>> Warning: properties incomplete for index 38 = ही >>> Config file is optional, continuing... >>> Failed to read data from: ./langdata/hin/hin.config >>> Failed to read data from: ./langdata/radical-stroke.txt >>> Error reading radical code table ./langdata/radical-stroke.txt >>> >>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/77cf0099-a40e-4186-b76c-b844832e2240%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/77cf0099-a40e-4186-b76c-b844832e2240%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/aadfb8a5-f3b7-4ab1-93c1-d0381d6ab3f3%40googlegroups.com.

