Shree, thanks for your reply.
But I have another problem in the project which needs your helpness: Some italicized characters in my data need to be identified, but these italic characters tend to be low in recognition. Can I add some italic characters to train our model? I have observed that we cannot add some italic characters in the chi_sim.training_text <https://github.com/tesseract-ocr/langdata/blob/master/chi_sim/chi_sim.training_text> file in the https://github.com/tesseract-ocr/langdata/tree/master/chi_sim link. How would I train these italic characters? 在 2017年9月14日星期四 UTC+8下午4:30:40,shree写道: > > It is a known problem with the latest code in github - see > https://github.com/tesseract-ocr/tesseract/issues/1114 > > Waiting for fix from Ray. > > ShreeDevi > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > > On Thu, Sep 14, 2017 at 1:50 PM, <[email protected] <javascript:>> > wrote: > >> Hello, >> >> I'm trying to train my traineddata model with Tess4.0, following the >> commands in the* TrainingTesseract 4.00 *tutorial. The first command to >> creat training data is showed as follows: >> >> training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim >> --linedata_only \ >> --noextract_font_properties --langdata_dir ../langdata \ >> --fontlist "SIMSUN" --tessdata_dir ./tessdata --output_dir >> ~/tesstutorial/trainspecial >> >> >> And the execution log for this command is as follows: >> >> === Phase I: Generating training images === >> Rendering using SIMSUN >> [2017年 09月 14日 星期四 16:01:57 CST] /usr/local/bin/text2image >> --fontconfig_tmpdir=/tmp/font_tmp.whlzhytMkp --fonts_dir=/usr/share/fonts >> --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 >> --outputbase=/tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.SIMSUN.exp0 --max_pages=3 >> --font=SIMSUN --text=../langdata/chi_sim/chi_sim.training_text >> Rendered page 0 to file >> /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.SIMSUN.exp0.tif >> >> === Phase UP: Generating unicharset and unichar properties files === >> [2017年 09月 14日 星期四 16:01:58 CST] /usr/local/bin/unicharset_extractor >> --output_unicharset /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.unicharset >> --norm_mode 1 /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.SIMSUN.exp0.box >> Extracting unicharset from box file >> /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.SIMSUN.exp0.box >> Invalid Unicode codepoint: 0xffffffe8 >> IsValidCodepoint(ch):Error:Assert failed:in file normstrngs.cpp, line 225 >> ERROR: /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.unicharset does not exist or >> is not readable >> >> >> But an error appears in this progress, which shows that chi_sim.unicharset >> extracted error. I have checked the directory of >> /tmp/tmp.8JcoYdZI17/chi_sim/, >> and chi_sim.unicharset file does not exist. >> >> How can I modify this error? Can you help me? Thanks. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/9b9b26b8-5fc8-42aa-bd7c-2305dffc6fd1%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/9b9b26b8-5fc8-42aa-bd7c-2305dffc6fd1%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7bbbc559-3af3-4971-9be0-4211dea9a699%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

