It is a known problem with the latest code in github - see https://github.com/tesseract-ocr/tesseract/issues/1114
Waiting for fix from Ray. ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Thu, Sep 14, 2017 at 1:50 PM, <[email protected]> wrote: > Hello, > > I'm trying to train my traineddata model with Tess4.0, following the > commands in the* TrainingTesseract 4.00 *tutorial. The first command to > creat training data is showed as follows: > > training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim > --linedata_only \ > --noextract_font_properties --langdata_dir ../langdata \ > --fontlist "SIMSUN" --tessdata_dir ./tessdata --output_dir > ~/tesstutorial/trainspecial > > > And the execution log for this command is as follows: > > === Phase I: Generating training images === > Rendering using SIMSUN > [2017年 09月 14日 星期四 16:01:57 CST] /usr/local/bin/text2image > --fontconfig_tmpdir=/tmp/font_tmp.whlzhytMkp --fonts_dir=/usr/share/fonts > --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 > --outputbase=/tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.SIMSUN.exp0 > --max_pages=3 --font=SIMSUN --text=../langdata/chi_sim/ > chi_sim.training_text > Rendered page 0 to file /tmp/tmp.8JcoYdZI17/chi_sim/ > chi_sim.SIMSUN.exp0.tif > > === Phase UP: Generating unicharset and unichar properties files === > [2017年 09月 14日 星期四 16:01:58 CST] /usr/local/bin/unicharset_extractor > --output_unicharset /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.unicharset > --norm_mode 1 /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.SIMSUN.exp0.box > Extracting unicharset from box file /tmp/tmp.8JcoYdZI17/chi_sim/ > chi_sim.SIMSUN.exp0.box > Invalid Unicode codepoint: 0xffffffe8 > IsValidCodepoint(ch):Error:Assert failed:in file normstrngs.cpp, line 225 > ERROR: /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.unicharset does not exist or > is not readable > > > But an error appears in this progress, which shows that chi_sim.unicharset > extracted error. I have checked the directory of /tmp/tmp.8JcoYdZI17/chi_sim/, > and chi_sim.unicharset file does not exist. > > How can I modify this error? Can you help me? Thanks. > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/9b9b26b8-5fc8-42aa-bd7c-2305dffc6fd1% > 40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/9b9b26b8-5fc8-42aa-bd7c-2305dffc6fd1%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX4691q2scCHMmBCNohXRybx3oNXdoK2fKTRcJ39Jqa7A%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

