Re: [tesseract-ocr] Incorrect segmentation of Chinese characters even after training a new model

ShreeDevi Kumar Mon, 25 Sep 2017 20:20:06 -0700

Use https://github.com/tesseract-ocr/tessdata_best if you are planning to
retrain


Use https://github.com/tesseract-ocr/tessdata_fast if you want to OCR

See the wiki page for more details
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#updated-data-files-for-version-400-september-15-2017

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Sep 26, 2017 at 1:01 AM, wei ren <[email protected]> wrote:

> Thank you for the suggestion. Will give tesseract 4.0 a try. I hear that
> tesseract 4.0 uses LSTM neural network, so its performance will be much
> better, especially for Chinese, but it may be much slower, is that true?
>
> By the way, I have also tried tweaking the parameters of tesseract 3.05,
> and have significantly improved the results with the following parameters:
>
> assume_fixed_pitch_char_segment  1
> textord_use_cjk_fp_model         1
> textord_old_xheight              1
> textord_min_xheight             60
> textord_noise_hfract           0.1
>
>
>
> On Thursday, September 21, 2017 at 4:01:26 AM UTC-7, shree wrote:
>>
>> You will have much better results if you use the new version of tesseract
>> from https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr
>> and the traineddata files from https://github.com/tesser
>> act-ocr/tessdata_best
>>
>> ShreeDevi
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Thu, Sep 21, 2017 at 2:44 PM, wei ren <[email protected]> wrote:
>>
>>> I am new to OCR and tesseract. Please forgive me if I ask some "stupid"
>>> questions.
>>>
>>> I try using tesseract 3.04.01 to recognize the Chinese characters in the
>>> attached two images and get absurd results, so I merge the two images into
>>> one and use the merged image yueyue.title.exp0.tif to train a new model.
>>> Below are the steps:
>>>
>>> 1. Create the box file.
>>>
>>> $ tesseract yueyue.title.exp0.tif yueyue.title.exp0 -l chi_sim
>>> batch.nochop makebox
>>>
>>> 2. Correct the errors in the box file in jTessBoxEditor.
>>>
>>> I fix the segmentation errors and assign the correct Chinese characters
>>> to the segmentations.
>>>
>>> 3. Train the new model.
>>>
>>> $ tesseract yueyue.title.exp0.tif yueyue.title.exp0 nobatch box.train
>>> $ unicharset_extractor yueyue.title.exp0.box
>>>
>>> 4. Define a font_properties file with the content.
>>>
>>> title 0 0 0 0 0
>>>
>>> 5. Clustering.
>>>
>>> $ shapeclustering -F font_properties -U unicharset yueyue.title.exp0.tr
>>> $ mftraining -F font_properties -U unicharset -O unicharset
>>> yueyue.title.exp0.tr
>>> $ cntraining yueyue.title.exp0.tr
>>>
>>> 6. Prefix all the files with "title.".
>>>
>>> $ mv unicharset title.unicharset
>>> $ mv inttemp title.inttemp
>>> $ mv pffmtable title.pffmtable
>>> $ mv shapetable title.shapetable
>>> $ mv normproto title.normproto
>>>
>>> 7. Put all the files together.
>>>
>>> $ combine_tessdata title.
>>>
>>> 8. Copy the new model to the tesseract-ocr tessdata directory.
>>>
>>> $ sudo cp title.traineddata /usr/share/tesseract-ocr/tessdata/
>>>
>>> Then I type the following command to recognize again the Chinese
>>> characters in the merged trained image.
>>>
>>> $ tesseract yueyue.title.exp0.tif stdout -l title
>>>
>>> Both the expected result is "老妇人和母鸡", but the actual result of the first
>>> page is "老 老老老妇 人老妇母老鸡老" and the actual result of the second page is
>>> "老老妇人和母老鸡". I generate a box file using the new model which is also
>>> attached,
>>>
>>> $ tesseract yueyue.title.exp0.tif yueyue.title.exp0 -l title
>>> batch.nochop makebox
>>>
>>> , and find that although tesseract only assigns the characters in the
>>> new model to the segmentations, it can't get the correct segmentations. As
>>> you can see, three characters are split into two segmentations,
>>> respectively. But when I correct the trained box file, I have merged those
>>> two segmentations into one.
>>>
>>>
>>>
>>> <https://lh3.googleusercontent.com/-r8UG3Svsbpo/WcN_98MjS7I/AAAAAAAAU8M/4ZMvHYfgOQ8OVp_fHIw__uZmTA6rFhyEgCLcBGAs/s1600/box2.png>
>>>
>>> <https://lh3.googleusercontent.com/-Wga1p7T579U/WcN_18CI2VI/AAAAAAAAU8I/Yvm9IB5zOGIXsbdcugPYTgiMRxbC02TTQCLcBGAs/s1600/box1.png>
>>>
>>>
>>>
>>> <https://lh3.googleusercontent.com/-Wga1p7T579U/WcN_18CI2VI/AAAAAAAAU8I/Yvm9IB5zOGIXsbdcugPYTgiMRxbC02TTQCLcBGAs/s1600/box1.png>
>>>
>>>
>>>
>>> <https://lh3.googleusercontent.com/-Wga1p7T579U/WcN_18CI2VI/AAAAAAAAU8I/Yvm9IB5zOGIXsbdcugPYTgiMRxbC02TTQCLcBGAs/s1600/box1.png>
>>>
>>>
>>>
>>> <https://lh3.googleusercontent.com/-Wga1p7T579U/WcN_18CI2VI/AAAAAAAAU8I/Yvm9IB5zOGIXsbdcugPYTgiMRxbC02TTQCLcBGAs/s1600/box1.png>
>>>
>>>
>>>
>>> <https://lh3.googleusercontent.com/-Wga1p7T579U/WcN_18CI2VI/AAAAAAAAU8I/Yvm9IB5zOGIXsbdcugPYTgiMRxbC02TTQCLcBGAs/s1600/box1.png>
>>>
>>> I have tried specified the font as bold and/or fixed in font_properties
>>> and it doesn't help. I have also tried various page segmentation methods
>>> and it doesn't help either.
>>>
>>>
>>> I also attach the trained tessdata here so you can easily reproduce the
>>> problems. Any hint or suggestion will be highly appreciated.
>>>
>>> <https://lh3.googleusercontent.com/-Wga1p7T579U/WcN_18CI2VI/AAAAAAAAU8I/Yvm9IB5zOGIXsbdcugPYTgiMRxbC02TTQCLcBGAs/s1600/box1.png>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/18590868-ba1e-457d-8953-b002987d497d%40goo
>>> glegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/18590868-ba1e-457d-8953-b002987d497d%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/4a702893-da3f-4b26-998e-aba4f04271cb%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/4a702893-da3f-4b26-998e-aba4f04271cb%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV_KvC-jJMZEv7NTsi5h8exCCFc5xA%2BUHAPHz863CWg8Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Incorrect segmentation of Chinese characters even after training a new model

Reply via email to