Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

Jingjing Lin Wed, 19 Jun 2019 10:36:22 -0700

Thanks for your comments. 

So did you mean we cannot use the method to add a special character to eng 
to add a special character to chi_sim? We'll have to retrain the top layer 
to achieve this?


Another question is, when we use a smaller .training_text, the .unicharset 
only contains a limited amount of chars. For Chinese, this unicharset is 
much smaller than the unicharset in langdata_lstm (github). How do we 
combine the original .traineddata with the .traineddata we generated via 
fine tuning? I tried the command below but it seems it's not doing the 
above thing I wanted it to do:

lstmtraining --stop_training \
  --continue_from ~/tesstutorial/eng_from_chi/base_checkpoint \
  --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
  --model_output ~/tesstutorial/eng_from_chi/eng.traineddata



在 2019年6月19日星期三 UTC-4上午11:44:22，shree写道：
>
> Update:
>
> 1. When using a smaller training_text for chi_sim for plus training, the 
> unicharset gets restricted. So, merge the lstm-unicharset with it.
>
> 2. The unicharset for chi_sim using langdata is different from the one 
> extracted from tessdata_best. so using training_text from langdata will add 
> more characters.
>
> 3. The fonts used for LSTM training are given in langdata_lstm in 
> okfonts.txt. For plus training same fonts should be used otherwise it will 
> require training of new typefaces.
>
> 4. Another user was trying to fine-tune chi_sim (check old forum posts) to 
> add theta sign. If I remember correctly, the plus type training did not 
> work for it. Replace top layer was the better option.
>
> 5. I am training with the following fonts. 
> "Adobe Heiti Std" \
> "Adobe Kaiti Std" \
> "Arial Unicode MS" \
> "Bitstream CyberCJK" \
> "Microsoft YaHei UI" \
> "Microsoft YaHei" \
> "NSimSun" \
> "Noto Sans CJK SC" \
> "Noto Sans Mono CJK SC" \
> "STXihei" \
> "SimSun" \
> "WenQuanYi Zen Hei Medium" \
> "WenQuanYi Zen Hei Mono Medium" \
> "WenQuanYi Zen Hei Sharp Medium" \
>
> At iteration 1046/1100/1100, Mean rms=0.704%, delta=1.445%, char 
> train=4.888%, word train=46.842%, skip ratio=0%,  New best char error = 
> 4.888 wrote best 
> model:/home/ubuntu/tesstutorial/chi_sim_plus/chi_sim_plus4.888_1046.checkpoint
>  
> wrote checkpoint.
>
>
> On Wed, Jun 19, 2019 at 12:36 AM Jingjing Lin <joejo...@gmail.com 
> <javascript:>> wrote:
>
>> Can you please test on arrows (↑ 
>> <https://en.wikipedia.org/wiki/%E2%86%91_(disambiguation)> or ↓ 
>> <https://en.wikipedia.org/wiki/%E2%86%93_(disambiguation)>) instead of ± 
>> if it's not inconvenient for you?
>>
>> 在 2019年6月18日星期二 UTC-4下午2:21:18，shree写道：
>>>
>>> I will test tomorrow and let you know
>>>
>>> On Tue, 18 Jun 2019, 23:47 Jingjing Lin, <joejo...@gmail.com> wrote:
>>>
>>>> It still couldn't work after I increased the number of ± to about 100. 
>>>> And the error rate after 2000 iterations is about 11. This is a pretty 
>>>> high 
>>>> error rate compare to what we have for adding a few characters to eng. 
>>>> With 
>>>> such high error rate, I would not be surprised that it could't recognize 
>>>> some special characters like ±. Is this it for chi_sim? Or can I increase 
>>>> iterations to make the error rate smaller? 
>>>> Thanks for your help.
>>>>
>>>> 在 2019年6月18日星期二 UTC-4上午10:32:37，shree写道：
>>>>>
>>>>>  increase the number of ± to about 100 
>>>>>
>>>>> On Tue, Jun 18, 2019 at 7:39 PM Jingjing Lin <joejo...@gmail.com> 
>>>>> wrote:
>>>>>
>>>>>> Sorry to bother you again and again.
>>>>>> I reduced the training text to about 450 lines, with like 30 ± in it. 
>>>>>> I used two fonts and iteration of 1000. But it looks like ± is still not 
>>>>>> picked up by the BEST OCR TEXT at all, it always recognizes ± as 
>>>>>> something 
>>>>>> else. What is happening here? Should I increase the number of ±? Or do I 
>>>>>> need to increase the number of fonts? I'm trying increasing iterations.
>>>>>>
>>>>>> 在 2019年6月18日星期二 UTC-4上午12:28:25，shree写道：
>>>>>>>
>>>>>>> If you increase the iterations then the plus type of training will 
>>>>>>> not give good result, i.e. the other letters will lose accuracy.
>>>>>>>
>>>>>>> You can try to reduce the training text size while still keeping all 
>>>>>>> the characters that you need as part of the training text, 
>>>>>>>
>>>>>>> On Tue, Jun 18, 2019 at 2:24 AM Jingjing Lin <joejo...@gmail.com> 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I was only using two different fonts and It only achieved lowest 
>>>>>>>> error rate of 11.271 after the training, does this mean I really need 
>>>>>>>> to 
>>>>>>>> increase the iterations?
>>>>>>>>
>>>>>>>> 在 2019年6月17日星期一 UTC-4下午2:16:31，shree写道：
>>>>>>>>>
>>>>>>>>> How big was your training text? How many iterations? Did the fonts 
>>>>>>>>> you use for training support the plus minus sign? 
>>>>>>>>>
>>>>>>>>> You can run training with -- debug-level of -1 so that you can see 
>>>>>>>>> whether the plus minus is being picked for training in the console 
>>>>>>>>> messages.
>>>>>>>>>
>>>>>>>>> On Mon, 17 Jun 2019, 23:29 Jingjing Lin, <joejo...@gmail.com> 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks. It works. The new character I added was there.
>>>>>>>>>>
>>>>>>>>>> Do you have any idea why after fine tuning tesseract still 
>>>>>>>>>> couldn't recognize the new character I added? When I tried to add 
>>>>>>>>>> '±' to 
>>>>>>>>>> eng it works, but when I tried to add '±' to chi_sim, it couldn't 
>>>>>>>>>> work 
>>>>>>>>>> (explained below). Is there anything we need to pay attention to 
>>>>>>>>>> when fine 
>>>>>>>>>> tuning other langs rather than eng?
>>>>>>>>>>
>>>>>>>>>> I used 
>>>>>>>>>>
>>>>>>>>>> lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
>>>>>>>>>>   --traineddata 
>>>>>>>>>> ~/tesstutorial/trainplusminus/chi_sim/chi_sim.traineddata \
>>>>>>>>>>   --eval_listfile 
>>>>>>>>>> ~/tesstutorial/evalplusminus/chi_sim.training_files.txt 2>&1 |
>>>>>>>>>>   grep ±
>>>>>>>>>>
>>>>>>>>>> to check and ± only shows up in Truth but not in OCR
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 在 2019年6月17日星期一 UTC-4上午11:31:24，shree写道：
>>>>>>>>>>>
>>>>>>>>>>> combine_tessdata -u new.traineddata new.
>>>>>>>>>>>
>>>>>>>>>>> will unpack the traineddata file. check new.lstm-unicharset in it
>>>>>>>>>>>
>>>>>>>>>>> On Monday, June 17, 2019 at 8:20:24 PM UTC+5:30, Jingjing Lin 
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> I tried to fine tune the model and add a new character via 
>>>>>>>>>>>> training, but it seems it still couldn't recognize this new 
>>>>>>>>>>>> character using 
>>>>>>>>>>>> the new traineddata generated. To debug I want to check whether 
>>>>>>>>>>>> this new 
>>>>>>>>>>>> character is in the .unicharset in the new traineddata generated. 
>>>>>>>>>>>> Is there 
>>>>>>>>>>>> a way to do this?
>>>>>>>>>>>>
>>>>>>>>>>> -- 
>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>>>> send an email to tesser...@googlegroups.com.
>>>>>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr
>>>>>>>>>> .
>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/d251e677-5f9d-4f8f-b41a-aa015538ca47%40googlegroups.com
>>>>>>>>>>  
>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/d251e677-5f9d-4f8f-b41a-aa015538ca47%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>> .
>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>> send an email to tesser...@googlegroups.com.
>>>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>>> To view this discussion on the web visit 
>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/692ad4d1-ff8e-4a67-a582-645a3fa5b941%40googlegroups.com
>>>>>>>>  
>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/692ad4d1-ff8e-4a67-a582-645a3fa5b941%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -- 
>>>>>>>
>>>>>>> ____________________________________________________________
>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to tesser...@googlegroups.com.
>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/6d299e90-fc12-4a52-989f-5b787db5f1f7%40googlegroups.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/6d299e90-fc12-4a52-989f-5b787db5f1f7%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>
>>>>> -- 
>>>>>
>>>>> ____________________________________________________________
>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to tesser...@googlegroups.com.
>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/d5d4c267-c6e4-41e6-b0ab-01391a1b666d%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/d5d4c267-c6e4-41e6-b0ab-01391a1b666d%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesser...@googlegroups.com <javascript:>.
>> To post to this group, send email to tesser...@googlegroups.com 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/1a993e08-1444-4791-a8b7-981c6ba0cdbd%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/1a993e08-1444-4791-a8b7-981c6ba0cdbd%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> -- 
>
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/09479520-dd6d-4971-aeda-52ac6e7ba4f9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

Reply via email to