Hi I'm trying to create a traineddata with a specific word list. What I have done so far is: 1.) Create specific files langdata/eng
- eng.wordlist (containing my specific words) - eng.finetune.training_text (representative text containing only chars found in my words) - eng.numbers and eng.punc (original English versions but removing chars not present in my words) 2.) Run tesstrain.sh on a couple of fonts to create a starter eng.traineddata, and run combine_tessdata -u to extract the new dawg files 3.) Check eng.charset_size=76.txt contains the expected chars and run wordlist2dawg -t to verify wordlist matches word-dawg 4.) Run combine_tessdata -o [best eng.traineddata] eng.word-dawg eng.punc-dawg eng.number-dawg eng.unicharset (to overwrite the original dawgs in the traineddata with my own). At the moment I cannot get step 4 to work, the process simply adds my dawgs into the traineddata with shortened names alongside the original ones. I have tried renaming my files to match those listed by combine_tessdata -d but it still renames and adds them as below: <https://lh3.googleusercontent.com/-tU5w_WUZv2w/W03BqH6xMSI/AAAAAAAAACU/BAJSuHF-iR4hHIQXbReCSDoS26ZpkUWzwCLcBGAs/s1600/combine_tessdata_screenshot1.png> Can anyone suggest what I might be doing wrong, or how best to incorporate my specific dawgs into best traineddata? Thanks James -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2ecff3bb-8066-4f1e-9a16-4845e95624f1%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.