a) You can use the -e option for the combine_tessdata tool to extract
individual components of the .traineddata file, like this:

combine_tessdata -e tessdata/eng.traineddata /home/$USER/temp/
eng.config /home/$USER/temp/eng.unicharset

For more details see this: 
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Putting_it_all_together
However this method won't give you box and tiff files, only "compiled"
files like inttemp. For box/tiff pairs you should ask the developers.

b) Don't know. Ask the developers.

c) Imho you have no general way to make Tess recognize words as
characters. Usually mixed languages present no problem for European
scripts. For your situation I see no other way but to train using
mixture of scripts and generate a joint language file. As a
significant relief you might use Chinese box/tiff pairs (English pairs
are accessible from the Downloads) but for some reasons (probably
copyright) Google holds them back.

Regards,
Dmitry Silaev

On Feb 7, 10:53 am, devTess <jim...@googlemail.com> wrote:
> I would like to change the recognition for e.g 10 -20 characters that
> do not work with the current language data,
>
> questions
>
> a) Is there a way to un-concatenate the language data for re-use in
> training?
>
> b) When will there be a box/tiff file for chinese?
>
> c) For text that has a mixture of chinese and english, what would be a
> good choice of parameters to perform OCR so that the english
> characters are recognized as words and not individual characters.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to