unicharset.5.html

Jing JC Tue, 15 Jul 2014 20:14:28 -0700

here is the flow how I generated .traineddata at at last. 

convert albasha.png albasha.tiff



# Each box results in a component which represents a single character.

tesseract eng.matrx60x40.exp0.tiff eng.matrx60x40.exp0 batch.nochop makebox


# tell Tesseract the correct results. We tell Tesseract the mistakes he 
made so he won’t make the same mistakes in a next recognition.

tesseract eng.matrx60x40.exp0.tiff eng.matrx60x40.exp0.box  nobatch 
box.train


# extract the charset from the box file.

unicharset_extractor eng.matrx60x40.exp0.box


# Syntax: fontname italic bold fixed serif fraktur

echo "matrx60x40 1 0 0 0 0" > font_properties



# The character features are clustered.

mftraining -F font_properties -U unicharset -O eng.unicharset 
eng.matrx60x40.exp0.box.tr


cntraining eng.matrx60x40.exp0.box.tr


# rename all files created by mftraing en cntraining, add the prefix eng.

mv Microfeat eng.Microfeat;mv normproto eng.normproto;

mv pffmtable eng.pffmtable;mv inttemp eng.inttemp

# mv mfunicharset eng.mfunicharset # only has unicharset


combine_tessdata eng.


any unicharset_extractor tool for non windows recommended, for CentOS, 
Ubuntu, or mac? 

seems I didn't have shape, clustering in my flow?
dictionary is optional. 

I am not very familiar with shape, cluster and the wordlist2dawg command. 
don't know how to utilize these commands to improve the quality of my 
.traineddata. 
any hints on that too? 



On Tuesday, 15 July 2014 12:41:37 UTC-7, Nick White wrote:
>
> Hi, 
>
> On Tue, Jul 15, 2014 at 10:04:24AM -0700, Jing JC wrote: 
> > yep yep. 
> > 
> > Thanks a lot Nick. 
> > 
> > I tried to cancel mu post last night. 
> > but seems I can not get access to it after posted but before approved. 
> > 
> > I tried to match the V2's example to V3's format. 
> > 
> > I figured it out later. 
>
> No problem, and don't worry about being unable to cancel your post; 
> it may yet help others in the future :) 
>
> > I have another question now: 
> > I am using 3.02, 
> > while I opened my eng.unicharset and my own .unicharset used for 
> generating my 
> > own .traineddata, 
> > it shows me 
> > 
> > "q 3 Latin 88 
> > & 10 Common 83 
> > ’ 10 Common 84 
>
> You're right, that is the v2 format. How are you generating your 
> .unicharset file? The unicharset_extractor tool normally outputs the 
> v3 format. 
>
> Nick 
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3fce9ee4-d474-4240-b4cb-b1770cbb4cb2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] questions when reading unicharset manual: https://tesseract-ocr.googlecode.com/svn-history/r683/trunk/doc/unicharset.5.html

Reply via email to