lang.cube.word-freq file

RNB Mon, 12 Sep 2011 18:43:41 -0700

Hello,
I was playing with tesseract and I realized the usage of dictionary
for improving recognition. While I found no dictionary files when I
extracted eng.trainneddata, I figured 'eng.cube.word-freq'  file
containing an exhaustive list of english words and couple of similar
files for other languages under /tessdata/ folder. I did try looking
up the group for related discussion, but I was unable to find them.
(May be I overlooked).


I also noticed that there was no difference on my results with and
without the cube.word.freq file. Though I have not trained the
tesseract for a new language/font, it does recognize most of my input
text properly for given eng.traineddata language. It is with minor
aberrations/ incorrect recognitions that I am working on.

I have a couple of questions on this.

1. what is the purpose of [lang].cube.word-freq file while it is not a
part of eng.traineddata? Is it part of an earlier version of
tesseract?
2. eng.cube.word-freq file contains 2 columns. What does the second
column signify? Is it the frequency or a weight associated with the
word?

I apologize if there was a discussion on this before. I would really
appreciate if somebody could clarify the above questions or guide me
to the relevant discussion.

Best,
Nitin

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

lang.cube.word-freq file

Reply via email to