Dear all,
I'm trying to generate a training text (chi_sim) for training tesseract
because I have a better dictionary and unigram/bigram data than the
defaults. I've found the following comments in training/language-specific.sh
(line 845)
# Set language-specific values for several global variables, including
# ${TEXT_CORPUS}
# holds the text corpus file for the language, used in phase F
# ${FONTS[@]}
# holds a sequence of applicable fonts for the language, used in
# phase F & I. only set if not already set, i.e. from command line
# ${TRAINING_DATA_ARGUMENTS}
# non-default arguments to the training_data program used in phase T
# ${FILTER_ARGUMENTS} -
# character-code-specific filtering to distinguish between scripts
# (eg. CJK) used by filter_borbidden_characters in phase F
# ${WORDLIST2DAWG_ARGUMENTS}
# specify fixed length dawg generation for non-space-delimited lang
# TODO(dsl): We can refactor these into functions that assign FONTS,
# TEXT_CORPUS, etc. separately.
So I suppose there are scripts called training_data (phrase T)
and filter_borbidden_characters (sic, phrase F) to create the training text
from some wordlists and unigram/bigram frequency data.
Where are these scripts, or how can I otherwise generate training text from
dictionary/corpus data?
Thanks.
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/9a5c68ce-43d5-449e-81c1-ff7237133053%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.