a) You can use the -e option for the combine_tessdata tool to extract individual components of the .traineddata file, like this:
combine_tessdata -e tessdata/eng.traineddata /home/$USER/temp/ eng.config /home/$USER/temp/eng.unicharset For more details see this: http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Putting_it_all_together However this method won't give you box and tiff files, only "compiled" files like inttemp. For box/tiff pairs you should ask the developers. b) Don't know. Ask the developers. c) Imho you have no general way to make Tess recognize words as characters. Usually mixed languages present no problem for European scripts. For your situation I see no other way but to train using mixture of scripts and generate a joint language file. As a significant relief you might use Chinese box/tiff pairs (English pairs are accessible from the Downloads) but for some reasons (probably copyright) Google holds them back. Regards, Dmitry Silaev On Feb 7, 10:53 am, devTess <jim...@googlemail.com> wrote: > I would like to change the recognition for e.g 10 -20 characters that > do not work with the current language data, > > questions > > a) Is there a way to un-concatenate the language data for re-use in > training? > > b) When will there be a box/tiff file for chinese? > > c) For text that has a mixture of chinese and english, what would be a > good choice of parameters to perform OCR so that the english > characters are recognized as words and not individual characters. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.