Re: Training individual characters in an existing language

Shree Devi Kumar Mon, 22 Apr 2013 03:31:16 -0700

See
http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/combine_tessdata.1.html


for instructions on how to unpack the unicharambigs file and how to
overwrite it in the traineddata after update.

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


On Mon, Apr 22, 2013 at 2:34 PM, Shree Devi Kumar <shreesh...@gmail.com>wrote:

> Please look at the unicharambigs file for your language. You can add these
> substitutions to the same and recombine the traineddata without needing to
> do any additional training.
>
> Please see http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3- 
> section on - The last file (unicharambigs)
>
>  The final data file that Tesseract uses is called unicharambigs. It
>> represents the intrinsic ambiguity between characters or sets of
>> characters, and is currently entirely manually generated. To understand the
>> file format, look at the following example:
>>
>> v1
>> 3       I I 0   2       u o     3
>> 3       I - I   1       H       2
>> 2       ' '     1       "       1
>>
>>
>> 2       ಕೊ 6    1       ಕೋ     1
>> 1       m       2       r n     0
>> 3       i i i   1       m       0
>>
>> The first line is a version identifier. The remaining lines consist of 5
>> tab-separated fields. The first field is the number of strings in the
>> second field. The 3rd field is the number of strings in the 4th field, and
>> the 5th field is a type indicator. The 2nd and 4th fields consist of a
>> number of space-separated strings. As with the other files, this is a UTF-8
>> format file, and therefore each string is a UTF-8 string. Each of these
>> strings must match the first field of some line in the unicharset file, ie
>> it must a recognizable unit.
>>
>
> If that doesn't work, you can try post-processing the OCR output. VietOCR
> allows a user defined susbtitution file for the same.
> See http://vietocr.sourceforge.net/usage.html - section on post-processing
>
> In addition to the built-in text postprocessing algorithm, you can add
>> your own custom text replacement scheme via a text file named
>> x.DangAmbigs.txt, where x is the ISO639-3 language code. The
>> UTF-8-encoded file should contain equal sign-delimited 
>> oldValue=newValuepairs.
>>
>
> Shree Devi Kumar
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
>
> On Mon, Apr 22, 2013 at 2:00 PM, Attila Sukosd <att...@opensourceshift.com
> > wrote:
>
>> Hi all,
>>
>> I'm trying to run some OCR on some old-ish danish datasets from 1970+,
>> and it seems like some of the characters are consequently recognized wrong:
>>
>> å => á
>> mm => nn
>> : => e
>> l => 1
>>
>> Is there any way to improve on the recognition of these individual
>> characters without having to retrain the complete font?
>> I've found a lot of documents on how to train a completely new font, but
>> not a lot on how to improve on existing ones.
>>
>> Best,
>>
>> Attila
>>
>> --
>> --
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To post to this group, send email to tesseract-ocr@googlegroups.com
>> To unsubscribe from this group, send email to
>> tesseract-ocr+unsubscr...@googlegroups.com
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en
>>
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
>>
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Training individual characters in an existing language

Reply via email to