[tesseract-ocr] How to regenerate the training text

Dingyuan Wang Thu, 15 Jun 2017 06:36:07 -0700

Dear all,

I'm trying to generate a training text (chi_sim) for training tesseract 
because I have a better dictionary and unigram/bigram data than the 
defaults. I've found the following comments in training/language-specific.sh


(line 845)
# Set language-specific values for several global variables, including
#   ${TEXT_CORPUS}
#      holds the text corpus file for the language, used in phase F
#   ${FONTS[@]}
#      holds a sequence of applicable fonts for the language, used in
#      phase F & I. only set if not already set, i.e. from command line
#   ${TRAINING_DATA_ARGUMENTS}
#      non-default arguments to the training_data program used in phase T
#   ${FILTER_ARGUMENTS} -
#      character-code-specific filtering to distinguish between scripts
#      (eg. CJK) used by filter_borbidden_characters in phase F
#   ${WORDLIST2DAWG_ARGUMENTS}
#      specify fixed length dawg generation for non-space-delimited lang
# TODO(dsl): We can refactor these into functions that assign FONTS,
# TEXT_CORPUS, etc. separately.

So I suppose there are scripts called training_data (phrase T) 
and filter_borbidden_characters (sic, phrase F) to create the training text 
from some wordlists and unigram/bigram frequency data.

Where are these scripts, or how can I otherwise generate training text from 
dictionary/corpus data?

Thanks.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/9a5c68ce-43d5-449e-81c1-ff7237133053%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] How to regenerate the training text

Reply via email to