See http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/combine_tessdata.1.html
for instructions on how to unpack the unicharambigs file and how to overwrite it in the traineddata after update. Shree Devi Kumar ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Mon, Apr 22, 2013 at 2:34 PM, Shree Devi Kumar <shreesh...@gmail.com>wrote: > Please look at the unicharambigs file for your language. You can add these > substitutions to the same and recombine the traineddata without needing to > do any additional training. > > Please see http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3- > section on - The last file (unicharambigs) > > The final data file that Tesseract uses is called unicharambigs. It >> represents the intrinsic ambiguity between characters or sets of >> characters, and is currently entirely manually generated. To understand the >> file format, look at the following example: >> >> v1 >> 3 I I 0 2 u o 3 >> 3 I - I 1 H 2 >> 2 ' ' 1 " 1 >> >> >> 2 ಕೊ 6 1 ಕೋ 1 >> 1 m 2 r n 0 >> 3 i i i 1 m 0 >> >> The first line is a version identifier. The remaining lines consist of 5 >> tab-separated fields. The first field is the number of strings in the >> second field. The 3rd field is the number of strings in the 4th field, and >> the 5th field is a type indicator. The 2nd and 4th fields consist of a >> number of space-separated strings. As with the other files, this is a UTF-8 >> format file, and therefore each string is a UTF-8 string. Each of these >> strings must match the first field of some line in the unicharset file, ie >> it must a recognizable unit. >> > > If that doesn't work, you can try post-processing the OCR output. VietOCR > allows a user defined susbtitution file for the same. > See http://vietocr.sourceforge.net/usage.html - section on post-processing > > In addition to the built-in text postprocessing algorithm, you can add >> your own custom text replacement scheme via a text file named >> x.DangAmbigs.txt, where x is the ISO639-3 language code. The >> UTF-8-encoded file should contain equal sign-delimited >> oldValue=newValuepairs. >> > > Shree Devi Kumar > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > > > On Mon, Apr 22, 2013 at 2:00 PM, Attila Sukosd <att...@opensourceshift.com > > wrote: > >> Hi all, >> >> I'm trying to run some OCR on some old-ish danish datasets from 1970+, >> and it seems like some of the characters are consequently recognized wrong: >> >> å => á >> mm => nn >> : => e >> l => 1 >> >> Is there any way to improve on the recognition of these individual >> characters without having to retrain the complete font? >> I've found a lot of documents on how to train a completely new font, but >> not a lot on how to improve on existing ones. >> >> Best, >> >> Attila >> >> -- >> -- >> You received this message because you are subscribed to the Google >> Groups "tesseract-ocr" group. >> To post to this group, send email to tesseract-ocr@googlegroups.com >> To unsubscribe from this group, send email to >> tesseract-ocr+unsubscr...@googlegroups.com >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en >> >> --- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-ocr+unsubscr...@googlegroups.com. >> For more options, visit https://groups.google.com/groups/opt_out. >> >> >> > > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.