[tesseract-ocr] Creating traineddata with specific wordlist

James Q Tue, 17 Jul 2018 03:17:27 -0700

Hi
I'm trying to create a traineddata with a specific word list. What I have 
done so far is:
1.) Create specific files langdata/eng


   - eng.wordlist (containing my specific words)
   - eng.finetune.training_text (representative text containing only chars 
   found in my words)
   - eng.numbers and eng.punc (original English versions but removing chars 
   not present in my words)

2.) Run tesstrain.sh on a couple of fonts to create a starter 
eng.traineddata, and run combine_tessdata -u to extract the new dawg files
3.) Check eng.charset_size=76.txt contains the expected chars and run 
wordlist2dawg 
-t to verify wordlist matches word-dawg
4.) Run combine_tessdata -o [best eng.traineddata] eng.word-dawg 
eng.punc-dawg eng.number-dawg eng.unicharset (to overwrite the original 
dawgs in the traineddata with my own).

At the moment I cannot get step 4 to work, the process simply adds my dawgs 
into the traineddata with shortened names alongside the original ones. I 
have tried renaming my files to match those listed by combine_tessdata -d but 
it still renames and adds them as below:

<https://lh3.googleusercontent.com/-tU5w_WUZv2w/W03BqH6xMSI/AAAAAAAAACU/BAJSuHF-iR4hHIQXbReCSDoS26ZpkUWzwCLcBGAs/s1600/combine_tessdata_screenshot1.png>

Can anyone suggest what I might be doing wrong, or how best to incorporate 
my specific dawgs into best traineddata?

Thanks
James


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/2ecff3bb-8066-4f1e-9a16-4845e95624f1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Creating traineddata with specific wordlist

Reply via email to