Hello, I was playing with tesseract and I realized the usage of dictionary for improving recognition. While I found no dictionary files when I extracted eng.trainneddata, I figured 'eng.cube.word-freq' containing an exhaustive list of english words and couple of similar files for other languages under /tessdata/ folder. I did try looking up the group for related discussion, but I was unable to find them. (May be I overlooked).
I also noticed that there was no difference on my results with and without the cube.word.freq file. Though I have not trained the tesseract for a new language/font, it does recognize most of my input text properly for given eng.traineddata language. It is with minor aberrations/ incorrect recognitions that I am working on. I have a couple of questions on this. 1. what is the purpose of [lang].cube.word-freq file while it is not a part of eng.traineddata? Is it part of an earlier version of tesseract? 2. eng.cube.word-freq file contains 2 columns. What does the second column signify? Is it the frequency or a weight associated with the word? I apologize if there was a discussion on this before. I would really appreciate if somebody could clarify the above questions or guide me to the relevant discussion. Best, Nitin -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

