Use https://github.com/tesseract-ocr/tessdata_best if you are planning to retrain
Use https://github.com/tesseract-ocr/tessdata_fast if you want to OCR See the wiki page for more details https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#updated-data-files-for-version-400-september-15-2017 ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Tue, Sep 26, 2017 at 1:01 AM, wei ren <[email protected]> wrote: > Thank you for the suggestion. Will give tesseract 4.0 a try. I hear that > tesseract 4.0 uses LSTM neural network, so its performance will be much > better, especially for Chinese, but it may be much slower, is that true? > > By the way, I have also tried tweaking the parameters of tesseract 3.05, > and have significantly improved the results with the following parameters: > > assume_fixed_pitch_char_segment 1 > textord_use_cjk_fp_model 1 > textord_old_xheight 1 > textord_min_xheight 60 > textord_noise_hfract 0.1 > > > > On Thursday, September 21, 2017 at 4:01:26 AM UTC-7, shree wrote: >> >> You will have much better results if you use the new version of tesseract >> from https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr >> and the traineddata files from https://github.com/tesser >> act-ocr/tessdata_best >> >> ShreeDevi >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> >> On Thu, Sep 21, 2017 at 2:44 PM, wei ren <[email protected]> wrote: >> >>> I am new to OCR and tesseract. Please forgive me if I ask some "stupid" >>> questions. >>> >>> I try using tesseract 3.04.01 to recognize the Chinese characters in the >>> attached two images and get absurd results, so I merge the two images into >>> one and use the merged image yueyue.title.exp0.tif to train a new model. >>> Below are the steps: >>> >>> 1. Create the box file. >>> >>> $ tesseract yueyue.title.exp0.tif yueyue.title.exp0 -l chi_sim >>> batch.nochop makebox >>> >>> 2. Correct the errors in the box file in jTessBoxEditor. >>> >>> I fix the segmentation errors and assign the correct Chinese characters >>> to the segmentations. >>> >>> 3. Train the new model. >>> >>> $ tesseract yueyue.title.exp0.tif yueyue.title.exp0 nobatch box.train >>> $ unicharset_extractor yueyue.title.exp0.box >>> >>> 4. Define a font_properties file with the content. >>> >>> title 0 0 0 0 0 >>> >>> 5. Clustering. >>> >>> $ shapeclustering -F font_properties -U unicharset yueyue.title.exp0.tr >>> $ mftraining -F font_properties -U unicharset -O unicharset >>> yueyue.title.exp0.tr >>> $ cntraining yueyue.title.exp0.tr >>> >>> 6. Prefix all the files with "title.". >>> >>> $ mv unicharset title.unicharset >>> $ mv inttemp title.inttemp >>> $ mv pffmtable title.pffmtable >>> $ mv shapetable title.shapetable >>> $ mv normproto title.normproto >>> >>> 7. Put all the files together. >>> >>> $ combine_tessdata title. >>> >>> 8. Copy the new model to the tesseract-ocr tessdata directory. >>> >>> $ sudo cp title.traineddata /usr/share/tesseract-ocr/tessdata/ >>> >>> Then I type the following command to recognize again the Chinese >>> characters in the merged trained image. >>> >>> $ tesseract yueyue.title.exp0.tif stdout -l title >>> >>> Both the expected result is "老妇人和母鸡", but the actual result of the first >>> page is "老 老老老妇 人老妇母老鸡老" and the actual result of the second page is >>> "老老妇人和母老鸡". I generate a box file using the new model which is also >>> attached, >>> >>> $ tesseract yueyue.title.exp0.tif yueyue.title.exp0 -l title >>> batch.nochop makebox >>> >>> , and find that although tesseract only assigns the characters in the >>> new model to the segmentations, it can't get the correct segmentations. As >>> you can see, three characters are split into two segmentations, >>> respectively. But when I correct the trained box file, I have merged those >>> two segmentations into one. >>> >>> >>> >>> <https://lh3.googleusercontent.com/-r8UG3Svsbpo/WcN_98MjS7I/AAAAAAAAU8M/4ZMvHYfgOQ8OVp_fHIw__uZmTA6rFhyEgCLcBGAs/s1600/box2.png> >>> >>> <https://lh3.googleusercontent.com/-Wga1p7T579U/WcN_18CI2VI/AAAAAAAAU8I/Yvm9IB5zOGIXsbdcugPYTgiMRxbC02TTQCLcBGAs/s1600/box1.png> >>> >>> >>> >>> <https://lh3.googleusercontent.com/-Wga1p7T579U/WcN_18CI2VI/AAAAAAAAU8I/Yvm9IB5zOGIXsbdcugPYTgiMRxbC02TTQCLcBGAs/s1600/box1.png> >>> >>> >>> >>> <https://lh3.googleusercontent.com/-Wga1p7T579U/WcN_18CI2VI/AAAAAAAAU8I/Yvm9IB5zOGIXsbdcugPYTgiMRxbC02TTQCLcBGAs/s1600/box1.png> >>> >>> >>> >>> <https://lh3.googleusercontent.com/-Wga1p7T579U/WcN_18CI2VI/AAAAAAAAU8I/Yvm9IB5zOGIXsbdcugPYTgiMRxbC02TTQCLcBGAs/s1600/box1.png> >>> >>> >>> >>> <https://lh3.googleusercontent.com/-Wga1p7T579U/WcN_18CI2VI/AAAAAAAAU8I/Yvm9IB5zOGIXsbdcugPYTgiMRxbC02TTQCLcBGAs/s1600/box1.png> >>> >>> I have tried specified the font as bold and/or fixed in font_properties >>> and it doesn't help. I have also tried various page segmentation methods >>> and it doesn't help either. >>> >>> >>> I also attach the trained tessdata here so you can easily reproduce the >>> problems. Any hint or suggestion will be highly appreciated. >>> >>> <https://lh3.googleusercontent.com/-Wga1p7T579U/WcN_18CI2VI/AAAAAAAAU8I/Yvm9IB5zOGIXsbdcugPYTgiMRxbC02TTQCLcBGAs/s1600/box1.png> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit https://groups.google.com/d/ms >>> gid/tesseract-ocr/18590868-ba1e-457d-8953-b002987d497d%40goo >>> glegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/18590868-ba1e-457d-8953-b002987d497d%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/4a702893-da3f-4b26-998e-aba4f04271cb% > 40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/4a702893-da3f-4b26-998e-aba4f04271cb%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV_KvC-jJMZEv7NTsi5h8exCCFc5xA%2BUHAPHz863CWg8Q%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

