lang.cube.word-freq file

nitin balajee Mon, 12 Sep 2011 18:43:43 -0700

Hello,
I was playing with tesseract and I realized the usage of dictionary for
improving recognition. While I found no dictionary files when I extracted
eng.trainneddata, I figured 'eng.cube.word-freq' containing an exhaustive
list of english words and couple of similar files for other languages under
/tessdata/ folder. I did try looking up the group for related discussion,
but I was unable to find them. (May be I overlooked).


I also noticed that there was no difference on my results with and without
the cube.word.freq file. Though I have not trained the tesseract for a new
language/font, it does recognize most of my input text properly for given
eng.traineddata language. It is with minor aberrations/ incorrect
recognitions that I am working on.

I have a couple of questions on this.

1. what is the purpose of [lang].cube.word-freq file while it is not a part
of eng.traineddata? Is it part of an earlier version of tesseract?
2. eng.cube.word-freq file contains 2 columns. What does the second column
signify? Is it the frequency or a weight associated with the word?

I apologize if there was a discussion on this before. I would really
appreciate if somebody could clarify the above questions or guide me to the
relevant discussion.

Best,
Nitin

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

lang.cube.word-freq file

Reply via email to