Re: Box/Tiff for Chinese

2011-02-07 Thread daemon-s
a) You can use the -e option for the combine_tessdata tool to extract individual components of the .traineddata file, like this: combine_tessdata -e tessdata/eng.traineddata /home/$USER/temp/ eng.config /home/$USER/temp/eng.unicharset For more details see this:

Provide/visualize baseline info?

2011-02-04 Thread daemon-s
Hi! I train Tess using separate images for every text line. Recognition is also ran over single text line images. Recognition performs pretty well, however there are many errors that, I believe, related to misdetected baselines, during training or recognition - I don't know. These include:

Tesseract and old fonts

2011-01-18 Thread daemon-s
*** On behalf of Andy Syme who could not post in this group probably due to spam removal artefacts *** ...my problem is that I have some documents written in 1890-1920 that I scanned want to OCR. They are in English using the standard English language file I was getting 40-50% recognition. I

Re: Tesseract and old fonts

2011-01-18 Thread daemon-s
Dear Andrew, I've a couple of observations on your problem. - The standard English language file was created using the set of training images of the famous computer fonts like Arial, Times, Verdana, some Ghostscript fonts and of their italic and bold versions. Your book document's characters