Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-19 Thread Jingjing Lin
Thanks for your comments. 

So did you mean we cannot use the method to add a special character to eng 
to add a special character to chi_sim? We'll have to retrain the top layer 
to achieve this?

Another question is, when we use a smaller .training_text, the .unicharset 
only contains a limited amount of chars. For Chinese, this unicharset is 
much smaller than the unicharset in langdata_lstm (github). How do we 
combine the original .traineddata with the .traineddata we generated via 
fine tuning? I tried the command below but it seems it's not doing the 
above thing I wanted it to do:

lstmtraining --stop_training \
  --continue_from ~/tesstutorial/eng_from_chi/base_checkpoint \
  --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
  --model_output ~/tesstutorial/eng_from_chi/eng.traineddata



在 2019年6月19日星期三 UTC-4上午11:44:22,shree写道:
>
> Update:
>
> 1. When using a smaller training_text for chi_sim for plus training, the 
> unicharset gets restricted. So, merge the lstm-unicharset with it.
>
> 2. The unicharset for chi_sim using langdata is different from the one 
> extracted from tessdata_best. so using training_text from langdata will add 
> more characters.
>
> 3. The fonts used for LSTM training are given in langdata_lstm in 
> okfonts.txt. For plus training same fonts should be used otherwise it will 
> require training of new typefaces.
>
> 4. Another user was trying to fine-tune chi_sim (check old forum posts) to 
> add theta sign. If I remember correctly, the plus type training did not 
> work for it. Replace top layer was the better option.
>
> 5. I am training with the following fonts. 
> "Adobe Heiti Std" \
> "Adobe Kaiti Std" \
> "Arial Unicode MS" \
> "Bitstream CyberCJK" \
> "Microsoft YaHei UI" \
> "Microsoft YaHei" \
> "NSimSun" \
> "Noto Sans CJK SC" \
> "Noto Sans Mono CJK SC" \
> "STXihei" \
> "SimSun" \
> "WenQuanYi Zen Hei Medium" \
> "WenQuanYi Zen Hei Mono Medium" \
> "WenQuanYi Zen Hei Sharp Medium" \
>
> At iteration 1046/1100/1100, Mean rms=0.704%, delta=1.445%, char 
> train=4.888%, word train=46.842%, skip ratio=0%,  New best char error = 
> 4.888 wrote best 
> model:/home/ubuntu/tesstutorial/chi_sim_plus/chi_sim_plus4.888_1046.checkpoint
>  
> wrote checkpoint.
>
>
> On Wed, Jun 19, 2019 at 12:36 AM Jingjing Lin  > wrote:
>
>> Can you please test on arrows (↑ 
>>  or ↓ 
>> ) instead of ± 
>> if it's not inconvenient for you?
>>
>> 在 2019年6月18日星期二 UTC-4下午2:21:18,shree写道:
>>>
>>> I will test tomorrow and let you know
>>>
>>> On Tue, 18 Jun 2019, 23:47 Jingjing Lin,  wrote:
>>>
 It still couldn't work after I increased the number of ± to about 100. 
 And the error rate after 2000 iterations is about 11. This is a pretty 
 high 
 error rate compare to what we have for adding a few characters to eng. 
 With 
 such high error rate, I would not be surprised that it could't recognize 
 some special characters like ±. Is this it for chi_sim? Or can I increase 
 iterations to make the error rate smaller? 
 Thanks for your help.

 在 2019年6月18日星期二 UTC-4上午10:32:37,shree写道:
>
>  increase the number of ± to about 100 
>
> On Tue, Jun 18, 2019 at 7:39 PM Jingjing Lin  
> wrote:
>
>> Sorry to bother you again and again.
>> I reduced the training text to about 450 lines, with like 30 ± in it. 
>> I used two fonts and iteration of 1000. But it looks like ± is still not 
>> picked up by the BEST OCR TEXT at all, it always recognizes ± as 
>> something 
>> else. What is happening here? Should I increase the number of ±? Or do I 
>> need to increase the number of fonts? I'm trying increasing iterations.
>>
>> 在 2019年6月18日星期二 UTC-4上午12:28:25,shree写道:
>>>
>>> If you increase the iterations then the plus type of training will 
>>> not give good result, i.e. the other letters will lose accuracy.
>>>
>>> You can try to reduce the training text size while still keeping all 
>>> the characters that you need as part of the training text, 
>>>
>>> On Tue, Jun 18, 2019 at 2:24 AM Jingjing Lin  
>>> wrote:
>>>
 I was only using two different fonts and It only achieved lowest 
 error rate of 11.271 after the training, does this mean I really need 
 to 
 increase the iterations?

 在 2019年6月17日星期一 UTC-4下午2:16:31,shree写道:
>
> How big was your training text? How many iterations? Did the fonts 
> you use for training support the plus minus sign? 
>
> You can run training with -- debug-level of -1 so that you can see 
> whether the plus minus is being picked for training in the console 
> messages.
>
> On Mon, 17 Jun 2019, 23:29 Jingjing Lin,  
> wrote:
>
>> Thanks. It works. The new character I added was th

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-19 Thread Shree Devi Kumar
Old thread
https://groups.google.com/forum/#!searchin/tesseract-ocr/layer$20chi_sim%7Csort:date/tesseract-ocr/iFMg7Gjczq4/f7_XRop2BAAJ


On Wed, Jun 19, 2019 at 9:13 PM Shree Devi Kumar 
wrote:

> Update:
>
> 1. When using a smaller training_text for chi_sim for plus training, the
> unicharset gets restricted. So, merge the lstm-unicharset with it.
>
> 2. The unicharset for chi_sim using langdata is different from the one
> extracted from tessdata_best. so using training_text from langdata will add
> more characters.
>
> 3. The fonts used for LSTM training are given in langdata_lstm in
> okfonts.txt. For plus training same fonts should be used otherwise it will
> require training of new typefaces.
>
> 4. Another user was trying to fine-tune chi_sim (check old forum posts) to
> add theta sign. If I remember correctly, the plus type training did not
> work for it. Replace top layer was the better option.
>
> 5. I am training with the following fonts.
> "Adobe Heiti Std" \
> "Adobe Kaiti Std" \
> "Arial Unicode MS" \
> "Bitstream CyberCJK" \
> "Microsoft YaHei UI" \
> "Microsoft YaHei" \
> "NSimSun" \
> "Noto Sans CJK SC" \
> "Noto Sans Mono CJK SC" \
> "STXihei" \
> "SimSun" \
> "WenQuanYi Zen Hei Medium" \
> "WenQuanYi Zen Hei Mono Medium" \
> "WenQuanYi Zen Hei Sharp Medium" \
>
> At iteration 1046/1100/1100, Mean rms=0.704%, delta=1.445%, char
> train=4.888%, word train=46.842%, skip ratio=0%,  New best char error =
> 4.888 wrote best
> model:/home/ubuntu/tesstutorial/chi_sim_plus/chi_sim_plus4.888_1046.checkpoint
> wrote checkpoint.
>
>
> On Wed, Jun 19, 2019 at 12:36 AM Jingjing Lin 
> wrote:
>
>> Can you please test on arrows (↑
>>  or ↓
>> ) instead of ±
>> if it's not inconvenient for you?
>>
>> 在 2019年6月18日星期二 UTC-4下午2:21:18,shree写道:
>>>
>>> I will test tomorrow and let you know
>>>
>>> On Tue, 18 Jun 2019, 23:47 Jingjing Lin,  wrote:
>>>
 It still couldn't work after I increased the number of ± to about 100.
 And the error rate after 2000 iterations is about 11. This is a pretty high
 error rate compare to what we have for adding a few characters to eng. With
 such high error rate, I would not be surprised that it could't recognize
 some special characters like ±. Is this it for chi_sim? Or can I increase
 iterations to make the error rate smaller?
 Thanks for your help.

 在 2019年6月18日星期二 UTC-4上午10:32:37,shree写道:
>
>  increase the number of ± to about 100
>
> On Tue, Jun 18, 2019 at 7:39 PM Jingjing Lin 
> wrote:
>
>> Sorry to bother you again and again.
>> I reduced the training text to about 450 lines, with like 30 ± in it.
>> I used two fonts and iteration of 1000. But it looks like ± is still not
>> picked up by the BEST OCR TEXT at all, it always recognizes ± as 
>> something
>> else. What is happening here? Should I increase the number of ±? Or do I
>> need to increase the number of fonts? I'm trying increasing iterations.
>>
>> 在 2019年6月18日星期二 UTC-4上午12:28:25,shree写道:
>>>
>>> If you increase the iterations then the plus type of training will
>>> not give good result, i.e. the other letters will lose accuracy.
>>>
>>> You can try to reduce the training text size while still keeping all
>>> the characters that you need as part of the training text,
>>>
>>> On Tue, Jun 18, 2019 at 2:24 AM Jingjing Lin 
>>> wrote:
>>>
 I was only using two different fonts and It only achieved lowest
 error rate of 11.271 after the training, does this mean I really need 
 to
 increase the iterations?

 在 2019年6月17日星期一 UTC-4下午2:16:31,shree写道:
>
> How big was your training text? How many iterations? Did the fonts
> you use for training support the plus minus sign?
>
> You can run training with -- debug-level of -1 so that you can see
> whether the plus minus is being picked for training in the console 
> messages.
>
> On Mon, 17 Jun 2019, 23:29 Jingjing Lin, 
> wrote:
>
>> Thanks. It works. The new character I added was there.
>>
>> Do you have any idea why after fine tuning tesseract still
>> couldn't recognize the new character I added? When I tried to add 
>> '±' to
>> eng it works, but when I tried to add '±' to chi_sim, it couldn't 
>> work
>> (explained below). Is there anything we need to pay attention to 
>> when fine
>> tuning other langs rather than eng?
>>
>> I used
>>
>> lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
>>   --traineddata 
>> ~/tesstutorial/trainplusminus/chi_sim/chi_sim.traineddata \
>>   --eval_listfile 
>> ~/tesstutorial/evalplusminus/chi_sim

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-19 Thread Shree Devi Kumar
Update:

1. When using a smaller training_text for chi_sim for plus training, the
unicharset gets restricted. So, merge the lstm-unicharset with it.

2. The unicharset for chi_sim using langdata is different from the one
extracted from tessdata_best. so using training_text from langdata will add
more characters.

3. The fonts used for LSTM training are given in langdata_lstm in
okfonts.txt. For plus training same fonts should be used otherwise it will
require training of new typefaces.

4. Another user was trying to fine-tune chi_sim (check old forum posts) to
add theta sign. If I remember correctly, the plus type training did not
work for it. Replace top layer was the better option.

5. I am training with the following fonts.
"Adobe Heiti Std" \
"Adobe Kaiti Std" \
"Arial Unicode MS" \
"Bitstream CyberCJK" \
"Microsoft YaHei UI" \
"Microsoft YaHei" \
"NSimSun" \
"Noto Sans CJK SC" \
"Noto Sans Mono CJK SC" \
"STXihei" \
"SimSun" \
"WenQuanYi Zen Hei Medium" \
"WenQuanYi Zen Hei Mono Medium" \
"WenQuanYi Zen Hei Sharp Medium" \

At iteration 1046/1100/1100, Mean rms=0.704%, delta=1.445%, char
train=4.888%, word train=46.842%, skip ratio=0%,  New best char error =
4.888 wrote best
model:/home/ubuntu/tesstutorial/chi_sim_plus/chi_sim_plus4.888_1046.checkpoint
wrote checkpoint.


On Wed, Jun 19, 2019 at 12:36 AM Jingjing Lin  wrote:

> Can you please test on arrows (↑
>  or ↓
> ) instead of ±
> if it's not inconvenient for you?
>
> 在 2019年6月18日星期二 UTC-4下午2:21:18,shree写道:
>>
>> I will test tomorrow and let you know
>>
>> On Tue, 18 Jun 2019, 23:47 Jingjing Lin,  wrote:
>>
>>> It still couldn't work after I increased the number of ± to about 100.
>>> And the error rate after 2000 iterations is about 11. This is a pretty high
>>> error rate compare to what we have for adding a few characters to eng. With
>>> such high error rate, I would not be surprised that it could't recognize
>>> some special characters like ±. Is this it for chi_sim? Or can I increase
>>> iterations to make the error rate smaller?
>>> Thanks for your help.
>>>
>>> 在 2019年6月18日星期二 UTC-4上午10:32:37,shree写道:

  increase the number of ± to about 100

 On Tue, Jun 18, 2019 at 7:39 PM Jingjing Lin 
 wrote:

> Sorry to bother you again and again.
> I reduced the training text to about 450 lines, with like 30 ± in it.
> I used two fonts and iteration of 1000. But it looks like ± is still not
> picked up by the BEST OCR TEXT at all, it always recognizes ± as something
> else. What is happening here? Should I increase the number of ±? Or do I
> need to increase the number of fonts? I'm trying increasing iterations.
>
> 在 2019年6月18日星期二 UTC-4上午12:28:25,shree写道:
>>
>> If you increase the iterations then the plus type of training will
>> not give good result, i.e. the other letters will lose accuracy.
>>
>> You can try to reduce the training text size while still keeping all
>> the characters that you need as part of the training text,
>>
>> On Tue, Jun 18, 2019 at 2:24 AM Jingjing Lin 
>> wrote:
>>
>>> I was only using two different fonts and It only achieved lowest
>>> error rate of 11.271 after the training, does this mean I really need to
>>> increase the iterations?
>>>
>>> 在 2019年6月17日星期一 UTC-4下午2:16:31,shree写道:

 How big was your training text? How many iterations? Did the fonts
 you use for training support the plus minus sign?

 You can run training with -- debug-level of -1 so that you can see
 whether the plus minus is being picked for training in the console 
 messages.

 On Mon, 17 Jun 2019, 23:29 Jingjing Lin, 
 wrote:

> Thanks. It works. The new character I added was there.
>
> Do you have any idea why after fine tuning tesseract still
> couldn't recognize the new character I added? When I tried to add '±' 
> to
> eng it works, but when I tried to add '±' to chi_sim, it couldn't work
> (explained below). Is there anything we need to pay attention to when 
> fine
> tuning other langs rather than eng?
>
> I used
>
> lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
>   --traineddata 
> ~/tesstutorial/trainplusminus/chi_sim/chi_sim.traineddata \
>   --eval_listfile 
> ~/tesstutorial/evalplusminus/chi_sim.training_files.txt 2>&1 |
>   grep ±
>
> to check and ± only shows up in Truth but not in OCR
>
>
> 在 2019年6月17日星期一 UTC-4上午11:31:24,shree写道:
>>
>> combine_tessdata -u new.traineddata new.
>>
>> will unpack the traineddata file. check new.lstm-unicharset in it
>>
>> On Monday, June 

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-18 Thread Jingjing Lin
Can you please test on arrows (↑ 
 or ↓ 
) instead of ± if 
it's not inconvenient for you?

在 2019年6月18日星期二 UTC-4下午2:21:18,shree写道:
>
> I will test tomorrow and let you know
>
> On Tue, 18 Jun 2019, 23:47 Jingjing Lin, > 
> wrote:
>
>> It still couldn't work after I increased the number of ± to about 100. 
>> And the error rate after 2000 iterations is about 11. This is a pretty high 
>> error rate compare to what we have for adding a few characters to eng. With 
>> such high error rate, I would not be surprised that it could't recognize 
>> some special characters like ±. Is this it for chi_sim? Or can I increase 
>> iterations to make the error rate smaller? 
>> Thanks for your help.
>>
>> 在 2019年6月18日星期二 UTC-4上午10:32:37,shree写道:
>>>
>>>  increase the number of ± to about 100 
>>>
>>> On Tue, Jun 18, 2019 at 7:39 PM Jingjing Lin  wrote:
>>>
 Sorry to bother you again and again.
 I reduced the training text to about 450 lines, with like 30 ± in it. I 
 used two fonts and iteration of 1000. But it looks like ± is still not 
 picked up by the BEST OCR TEXT at all, it always recognizes ± as something 
 else. What is happening here? Should I increase the number of ±? Or do I 
 need to increase the number of fonts? I'm trying increasing iterations.

 在 2019年6月18日星期二 UTC-4上午12:28:25,shree写道:
>
> If you increase the iterations then the plus type of training will not 
> give good result, i.e. the other letters will lose accuracy.
>
> You can try to reduce the training text size while still keeping all 
> the characters that you need as part of the training text, 
>
> On Tue, Jun 18, 2019 at 2:24 AM Jingjing Lin  
> wrote:
>
>> I was only using two different fonts and It only achieved lowest 
>> error rate of 11.271 after the training, does this mean I really need to 
>> increase the iterations?
>>
>> 在 2019年6月17日星期一 UTC-4下午2:16:31,shree写道:
>>>
>>> How big was your training text? How many iterations? Did the fonts 
>>> you use for training support the plus minus sign? 
>>>
>>> You can run training with -- debug-level of -1 so that you can see 
>>> whether the plus minus is being picked for training in the console 
>>> messages.
>>>
>>> On Mon, 17 Jun 2019, 23:29 Jingjing Lin,  wrote:
>>>
 Thanks. It works. The new character I added was there.

 Do you have any idea why after fine tuning tesseract still couldn't 
 recognize the new character I added? When I tried to add '±' to eng it 
 works, but when I tried to add '±' to chi_sim, it couldn't work 
 (explained 
 below). Is there anything we need to pay attention to when fine tuning 
 other langs rather than eng?

 I used 

 lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
   --traineddata 
 ~/tesstutorial/trainplusminus/chi_sim/chi_sim.traineddata \
   --eval_listfile 
 ~/tesstutorial/evalplusminus/chi_sim.training_files.txt 2>&1 |
   grep ±

 to check and ± only shows up in Truth but not in OCR


 在 2019年6月17日星期一 UTC-4上午11:31:24,shree写道:
>
> combine_tessdata -u new.traineddata new.
>
> will unpack the traineddata file. check new.lstm-unicharset in it
>
> On Monday, June 17, 2019 at 8:20:24 PM UTC+5:30, Jingjing Lin 
> wrote:
>>
>> I tried to fine tune the model and add a new character via 
>> training, but it seems it still couldn't recognize this new 
>> character using 
>> the new traineddata generated. To debug I want to check whether this 
>> new 
>> character is in the .unicharset in the new traineddata generated. Is 
>> there 
>> a way to do this?
>>
> -- 
 You received this message because you are subscribed to the Google 
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, 
 send an email to tesser...@googlegroups.com.
 To post to this group, send email to tesser...@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/tesseract-ocr/d251e677-5f9d-4f8f-b41a-aa015538ca47%40googlegroups.com
  
 
 .
 For more options, visit https://groups.google.com/d/optout.

>>> -- 
>> You received this message because you are subscribed to the Google 
>> Groups "tesseract-ocr" group.
>> 

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-18 Thread Jingjing Lin
Thanks a lot!

在 2019年6月18日星期二 UTC-4下午2:21:18,shree写道:
>
> I will test tomorrow and let you know
>
> On Tue, 18 Jun 2019, 23:47 Jingjing Lin, > 
> wrote:
>
>> It still couldn't work after I increased the number of ± to about 100. 
>> And the error rate after 2000 iterations is about 11. This is a pretty high 
>> error rate compare to what we have for adding a few characters to eng. With 
>> such high error rate, I would not be surprised that it could't recognize 
>> some special characters like ±. Is this it for chi_sim? Or can I increase 
>> iterations to make the error rate smaller? 
>> Thanks for your help.
>>
>> 在 2019年6月18日星期二 UTC-4上午10:32:37,shree写道:
>>>
>>>  increase the number of ± to about 100 
>>>
>>> On Tue, Jun 18, 2019 at 7:39 PM Jingjing Lin  wrote:
>>>
 Sorry to bother you again and again.
 I reduced the training text to about 450 lines, with like 30 ± in it. I 
 used two fonts and iteration of 1000. But it looks like ± is still not 
 picked up by the BEST OCR TEXT at all, it always recognizes ± as something 
 else. What is happening here? Should I increase the number of ±? Or do I 
 need to increase the number of fonts? I'm trying increasing iterations.

 在 2019年6月18日星期二 UTC-4上午12:28:25,shree写道:
>
> If you increase the iterations then the plus type of training will not 
> give good result, i.e. the other letters will lose accuracy.
>
> You can try to reduce the training text size while still keeping all 
> the characters that you need as part of the training text, 
>
> On Tue, Jun 18, 2019 at 2:24 AM Jingjing Lin  
> wrote:
>
>> I was only using two different fonts and It only achieved lowest 
>> error rate of 11.271 after the training, does this mean I really need to 
>> increase the iterations?
>>
>> 在 2019年6月17日星期一 UTC-4下午2:16:31,shree写道:
>>>
>>> How big was your training text? How many iterations? Did the fonts 
>>> you use for training support the plus minus sign? 
>>>
>>> You can run training with -- debug-level of -1 so that you can see 
>>> whether the plus minus is being picked for training in the console 
>>> messages.
>>>
>>> On Mon, 17 Jun 2019, 23:29 Jingjing Lin,  wrote:
>>>
 Thanks. It works. The new character I added was there.

 Do you have any idea why after fine tuning tesseract still couldn't 
 recognize the new character I added? When I tried to add '±' to eng it 
 works, but when I tried to add '±' to chi_sim, it couldn't work 
 (explained 
 below). Is there anything we need to pay attention to when fine tuning 
 other langs rather than eng?

 I used 

 lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
   --traineddata 
 ~/tesstutorial/trainplusminus/chi_sim/chi_sim.traineddata \
   --eval_listfile 
 ~/tesstutorial/evalplusminus/chi_sim.training_files.txt 2>&1 |
   grep ±

 to check and ± only shows up in Truth but not in OCR


 在 2019年6月17日星期一 UTC-4上午11:31:24,shree写道:
>
> combine_tessdata -u new.traineddata new.
>
> will unpack the traineddata file. check new.lstm-unicharset in it
>
> On Monday, June 17, 2019 at 8:20:24 PM UTC+5:30, Jingjing Lin 
> wrote:
>>
>> I tried to fine tune the model and add a new character via 
>> training, but it seems it still couldn't recognize this new 
>> character using 
>> the new traineddata generated. To debug I want to check whether this 
>> new 
>> character is in the .unicharset in the new traineddata generated. Is 
>> there 
>> a way to do this?
>>
> -- 
 You received this message because you are subscribed to the Google 
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, 
 send an email to tesser...@googlegroups.com.
 To post to this group, send email to tesser...@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/tesseract-ocr/d251e677-5f9d-4f8f-b41a-aa015538ca47%40googlegroups.com
  
 
 .
 For more options, visit https://groups.google.com/d/optout.

>>> -- 
>> You received this message because you are subscribed to the Google 
>> Groups "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, 
>> send an email to tesser...@googlegroups.com.
>> To post to this group, send email to tesser...@googlegroups.com.
>>

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-18 Thread Shree Devi Kumar
I will test tomorrow and let you know

On Tue, 18 Jun 2019, 23:47 Jingjing Lin,  wrote:

> It still couldn't work after I increased the number of ± to about 100. And
> the error rate after 2000 iterations is about 11. This is a pretty high
> error rate compare to what we have for adding a few characters to eng. With
> such high error rate, I would not be surprised that it could't recognize
> some special characters like ±. Is this it for chi_sim? Or can I increase
> iterations to make the error rate smaller?
> Thanks for your help.
>
> 在 2019年6月18日星期二 UTC-4上午10:32:37,shree写道:
>>
>>  increase the number of ± to about 100
>>
>> On Tue, Jun 18, 2019 at 7:39 PM Jingjing Lin  wrote:
>>
>>> Sorry to bother you again and again.
>>> I reduced the training text to about 450 lines, with like 30 ± in it. I
>>> used two fonts and iteration of 1000. But it looks like ± is still not
>>> picked up by the BEST OCR TEXT at all, it always recognizes ± as something
>>> else. What is happening here? Should I increase the number of ±? Or do I
>>> need to increase the number of fonts? I'm trying increasing iterations.
>>>
>>> 在 2019年6月18日星期二 UTC-4上午12:28:25,shree写道:

 If you increase the iterations then the plus type of training will not
 give good result, i.e. the other letters will lose accuracy.

 You can try to reduce the training text size while still keeping all
 the characters that you need as part of the training text,

 On Tue, Jun 18, 2019 at 2:24 AM Jingjing Lin 
 wrote:

> I was only using two different fonts and It only achieved lowest error
> rate of 11.271 after the training, does this mean I really need to 
> increase
> the iterations?
>
> 在 2019年6月17日星期一 UTC-4下午2:16:31,shree写道:
>>
>> How big was your training text? How many iterations? Did the fonts
>> you use for training support the plus minus sign?
>>
>> You can run training with -- debug-level of -1 so that you can see
>> whether the plus minus is being picked for training in the console 
>> messages.
>>
>> On Mon, 17 Jun 2019, 23:29 Jingjing Lin,  wrote:
>>
>>> Thanks. It works. The new character I added was there.
>>>
>>> Do you have any idea why after fine tuning tesseract still couldn't
>>> recognize the new character I added? When I tried to add '±' to eng it
>>> works, but when I tried to add '±' to chi_sim, it couldn't work 
>>> (explained
>>> below). Is there anything we need to pay attention to when fine tuning
>>> other langs rather than eng?
>>>
>>> I used
>>>
>>> lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
>>>   --traineddata 
>>> ~/tesstutorial/trainplusminus/chi_sim/chi_sim.traineddata \
>>>   --eval_listfile 
>>> ~/tesstutorial/evalplusminus/chi_sim.training_files.txt 2>&1 |
>>>   grep ±
>>>
>>> to check and ± only shows up in Truth but not in OCR
>>>
>>>
>>> 在 2019年6月17日星期一 UTC-4上午11:31:24,shree写道:

 combine_tessdata -u new.traineddata new.

 will unpack the traineddata file. check new.lstm-unicharset in it

 On Monday, June 17, 2019 at 8:20:24 PM UTC+5:30, Jingjing Lin wrote:
>
> I tried to fine tune the model and add a new character via
> training, but it seems it still couldn't recognize this new character 
> using
> the new traineddata generated. To debug I want to check whether this 
> new
> character is in the .unicharset in the new traineddata generated. Is 
> there
> a way to do this?
>
 --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it,
>>> send an email to tesser...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/d251e677-5f9d-4f8f-b41a-aa015538ca47%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to tesser...@googlegroups.com.
> To post to this group, send email to tesser...@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/692ad4d1-ff8e-4a67-a582-645a3fa5b941%40googlegroup

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-18 Thread Jingjing Lin
It still couldn't work after I increased the number of ± to about 100. And 
the error rate after 2000 iterations is about 11. This is a pretty high 
error rate compare to what we have for adding a few characters to eng. With 
such high error rate, I would not be surprised that it could't recognize 
some special characters like ±. Is this it for chi_sim? Or can I increase 
iterations to make the error rate smaller? 
Thanks for your help.

在 2019年6月18日星期二 UTC-4上午10:32:37,shree写道:
>
>  increase the number of ± to about 100 
>
> On Tue, Jun 18, 2019 at 7:39 PM Jingjing Lin  > wrote:
>
>> Sorry to bother you again and again.
>> I reduced the training text to about 450 lines, with like 30 ± in it. I 
>> used two fonts and iteration of 1000. But it looks like ± is still not 
>> picked up by the BEST OCR TEXT at all, it always recognizes ± as something 
>> else. What is happening here? Should I increase the number of ±? Or do I 
>> need to increase the number of fonts? I'm trying increasing iterations.
>>
>> 在 2019年6月18日星期二 UTC-4上午12:28:25,shree写道:
>>>
>>> If you increase the iterations then the plus type of training will not 
>>> give good result, i.e. the other letters will lose accuracy.
>>>
>>> You can try to reduce the training text size while still keeping all the 
>>> characters that you need as part of the training text, 
>>>
>>> On Tue, Jun 18, 2019 at 2:24 AM Jingjing Lin  wrote:
>>>
 I was only using two different fonts and It only achieved lowest error 
 rate of 11.271 after the training, does this mean I really need to 
 increase 
 the iterations?

 在 2019年6月17日星期一 UTC-4下午2:16:31,shree写道:
>
> How big was your training text? How many iterations? Did the fonts you 
> use for training support the plus minus sign? 
>
> You can run training with -- debug-level of -1 so that you can see 
> whether the plus minus is being picked for training in the console 
> messages.
>
> On Mon, 17 Jun 2019, 23:29 Jingjing Lin,  wrote:
>
>> Thanks. It works. The new character I added was there.
>>
>> Do you have any idea why after fine tuning tesseract still couldn't 
>> recognize the new character I added? When I tried to add '±' to eng it 
>> works, but when I tried to add '±' to chi_sim, it couldn't work 
>> (explained 
>> below). Is there anything we need to pay attention to when fine tuning 
>> other langs rather than eng?
>>
>> I used 
>>
>> lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
>>   --traineddata 
>> ~/tesstutorial/trainplusminus/chi_sim/chi_sim.traineddata \
>>   --eval_listfile 
>> ~/tesstutorial/evalplusminus/chi_sim.training_files.txt 2>&1 |
>>   grep ±
>>
>> to check and ± only shows up in Truth but not in OCR
>>
>>
>> 在 2019年6月17日星期一 UTC-4上午11:31:24,shree写道:
>>>
>>> combine_tessdata -u new.traineddata new.
>>>
>>> will unpack the traineddata file. check new.lstm-unicharset in it
>>>
>>> On Monday, June 17, 2019 at 8:20:24 PM UTC+5:30, Jingjing Lin wrote:

 I tried to fine tune the model and add a new character via 
 training, but it seems it still couldn't recognize this new character 
 using 
 the new traineddata generated. To debug I want to check whether this 
 new 
 character is in the .unicharset in the new traineddata generated. Is 
 there 
 a way to do this?

>>> -- 
>> You received this message because you are subscribed to the Google 
>> Groups "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, 
>> send an email to tesser...@googlegroups.com.
>> To post to this group, send email to tesser...@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/d251e677-5f9d-4f8f-b41a-aa015538ca47%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
> -- 
 You received this message because you are subscribed to the Google 
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to tesser...@googlegroups.com.
 To post to this group, send email to tesser...@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/tesseract-ocr/692ad4d1-ff8e-4a67-a582-645a3fa5b941%40googlegroups.com
  
 
>>>

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-18 Thread Shree Devi Kumar
 increase the number of ± to about 100

On Tue, Jun 18, 2019 at 7:39 PM Jingjing Lin  wrote:

> Sorry to bother you again and again.
> I reduced the training text to about 450 lines, with like 30 ± in it. I
> used two fonts and iteration of 1000. But it looks like ± is still not
> picked up by the BEST OCR TEXT at all, it always recognizes ± as something
> else. What is happening here? Should I increase the number of ±? Or do I
> need to increase the number of fonts? I'm trying increasing iterations.
>
> 在 2019年6月18日星期二 UTC-4上午12:28:25,shree写道:
>>
>> If you increase the iterations then the plus type of training will not
>> give good result, i.e. the other letters will lose accuracy.
>>
>> You can try to reduce the training text size while still keeping all the
>> characters that you need as part of the training text,
>>
>> On Tue, Jun 18, 2019 at 2:24 AM Jingjing Lin  wrote:
>>
>>> I was only using two different fonts and It only achieved lowest error
>>> rate of 11.271 after the training, does this mean I really need to increase
>>> the iterations?
>>>
>>> 在 2019年6月17日星期一 UTC-4下午2:16:31,shree写道:

 How big was your training text? How many iterations? Did the fonts you
 use for training support the plus minus sign?

 You can run training with -- debug-level of -1 so that you can see
 whether the plus minus is being picked for training in the console 
 messages.

 On Mon, 17 Jun 2019, 23:29 Jingjing Lin,  wrote:

> Thanks. It works. The new character I added was there.
>
> Do you have any idea why after fine tuning tesseract still couldn't
> recognize the new character I added? When I tried to add '±' to eng it
> works, but when I tried to add '±' to chi_sim, it couldn't work (explained
> below). Is there anything we need to pay attention to when fine tuning
> other langs rather than eng?
>
> I used
>
> lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
>   --traineddata ~/tesstutorial/trainplusminus/chi_sim/chi_sim.traineddata 
> \
>   --eval_listfile ~/tesstutorial/evalplusminus/chi_sim.training_files.txt 
> 2>&1 |
>   grep ±
>
> to check and ± only shows up in Truth but not in OCR
>
>
> 在 2019年6月17日星期一 UTC-4上午11:31:24,shree写道:
>>
>> combine_tessdata -u new.traineddata new.
>>
>> will unpack the traineddata file. check new.lstm-unicharset in it
>>
>> On Monday, June 17, 2019 at 8:20:24 PM UTC+5:30, Jingjing Lin wrote:
>>>
>>> I tried to fine tune the model and add a new character via training,
>>> but it seems it still couldn't recognize this new character using the 
>>> new
>>> traineddata generated. To debug I want to check whether this new 
>>> character
>>> is in the .unicharset in the new traineddata generated. Is there a way 
>>> to
>>> do this?
>>>
>> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to tesser...@googlegroups.com.
> To post to this group, send email to tesser...@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/d251e677-5f9d-4f8f-b41a-aa015538ca47%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>
 --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesser...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/692ad4d1-ff8e-4a67-a582-645a3fa5b941%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>> --
>>
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-18 Thread Jingjing Lin
Sorry to bother you again and again.
I reduced the training text to about 450 lines, with like 30 ± in it. I 
used two fonts and iteration of 1000. But it looks like ± is still not 
picked up by the BEST OCR TEXT at all, it always recognizes ± as something 
else. What is happening here? Should I increase the number of ±? Or do I 
need to increase the number of fonts? I'm trying increasing iterations.

在 2019年6月18日星期二 UTC-4上午12:28:25,shree写道:
>
> If you increase the iterations then the plus type of training will not 
> give good result, i.e. the other letters will lose accuracy.
>
> You can try to reduce the training text size while still keeping all the 
> characters that you need as part of the training text, 
>
> On Tue, Jun 18, 2019 at 2:24 AM Jingjing Lin  > wrote:
>
>> I was only using two different fonts and It only achieved lowest error 
>> rate of 11.271 after the training, does this mean I really need to increase 
>> the iterations?
>>
>> 在 2019年6月17日星期一 UTC-4下午2:16:31,shree写道:
>>>
>>> How big was your training text? How many iterations? Did the fonts you 
>>> use for training support the plus minus sign? 
>>>
>>> You can run training with -- debug-level of -1 so that you can see 
>>> whether the plus minus is being picked for training in the console messages.
>>>
>>> On Mon, 17 Jun 2019, 23:29 Jingjing Lin,  wrote:
>>>
 Thanks. It works. The new character I added was there.

 Do you have any idea why after fine tuning tesseract still couldn't 
 recognize the new character I added? When I tried to add '±' to eng it 
 works, but when I tried to add '±' to chi_sim, it couldn't work (explained 
 below). Is there anything we need to pay attention to when fine tuning 
 other langs rather than eng?

 I used 

 lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
   --traineddata ~/tesstutorial/trainplusminus/chi_sim/chi_sim.traineddata \
   --eval_listfile ~/tesstutorial/evalplusminus/chi_sim.training_files.txt 
 2>&1 |
   grep ±

 to check and ± only shows up in Truth but not in OCR


 在 2019年6月17日星期一 UTC-4上午11:31:24,shree写道:
>
> combine_tessdata -u new.traineddata new.
>
> will unpack the traineddata file. check new.lstm-unicharset in it
>
> On Monday, June 17, 2019 at 8:20:24 PM UTC+5:30, Jingjing Lin wrote:
>>
>> I tried to fine tune the model and add a new character via training, 
>> but it seems it still couldn't recognize this new character using the 
>> new 
>> traineddata generated. To debug I want to check whether this new 
>> character 
>> is in the .unicharset in the new traineddata generated. Is there a way 
>> to 
>> do this?
>>
> -- 
 You received this message because you are subscribed to the Google 
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to tesser...@googlegroups.com.
 To post to this group, send email to tesser...@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/tesseract-ocr/d251e677-5f9d-4f8f-b41a-aa015538ca47%40googlegroups.com
  
 
 .
 For more options, visit https://groups.google.com/d/optout.

>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesser...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/692ad4d1-ff8e-4a67-a582-645a3fa5b941%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> -- 
>
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6d299e90-fc12-4a52-989f-5b787db5f1f7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-18 Thread Jingjing Lin
Thanks for your advice. I'll try reduce the training text size.

在 2019年6月18日星期二 UTC-4上午12:28:25,shree写道:
>
> If you increase the iterations then the plus type of training will not 
> give good result, i.e. the other letters will lose accuracy.
>
> You can try to reduce the training text size while still keeping all the 
> characters that you need as part of the training text, 
>
> On Tue, Jun 18, 2019 at 2:24 AM Jingjing Lin  > wrote:
>
>> I was only using two different fonts and It only achieved lowest error 
>> rate of 11.271 after the training, does this mean I really need to increase 
>> the iterations?
>>
>> 在 2019年6月17日星期一 UTC-4下午2:16:31,shree写道:
>>>
>>> How big was your training text? How many iterations? Did the fonts you 
>>> use for training support the plus minus sign? 
>>>
>>> You can run training with -- debug-level of -1 so that you can see 
>>> whether the plus minus is being picked for training in the console messages.
>>>
>>> On Mon, 17 Jun 2019, 23:29 Jingjing Lin,  wrote:
>>>
 Thanks. It works. The new character I added was there.

 Do you have any idea why after fine tuning tesseract still couldn't 
 recognize the new character I added? When I tried to add '±' to eng it 
 works, but when I tried to add '±' to chi_sim, it couldn't work (explained 
 below). Is there anything we need to pay attention to when fine tuning 
 other langs rather than eng?

 I used 

 lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
   --traineddata ~/tesstutorial/trainplusminus/chi_sim/chi_sim.traineddata \
   --eval_listfile ~/tesstutorial/evalplusminus/chi_sim.training_files.txt 
 2>&1 |
   grep ±

 to check and ± only shows up in Truth but not in OCR


 在 2019年6月17日星期一 UTC-4上午11:31:24,shree写道:
>
> combine_tessdata -u new.traineddata new.
>
> will unpack the traineddata file. check new.lstm-unicharset in it
>
> On Monday, June 17, 2019 at 8:20:24 PM UTC+5:30, Jingjing Lin wrote:
>>
>> I tried to fine tune the model and add a new character via training, 
>> but it seems it still couldn't recognize this new character using the 
>> new 
>> traineddata generated. To debug I want to check whether this new 
>> character 
>> is in the .unicharset in the new traineddata generated. Is there a way 
>> to 
>> do this?
>>
> -- 
 You received this message because you are subscribed to the Google 
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to tesser...@googlegroups.com.
 To post to this group, send email to tesser...@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/tesseract-ocr/d251e677-5f9d-4f8f-b41a-aa015538ca47%40googlegroups.com
  
 
 .
 For more options, visit https://groups.google.com/d/optout.

>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesser...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/692ad4d1-ff8e-4a67-a582-645a3fa5b941%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> -- 
>
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1c8559e0-3160-43c9-89c2-93d3769697f1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-17 Thread Shree Devi Kumar
If you increase the iterations then the plus type of training will not give
good result, i.e. the other letters will lose accuracy.

You can try to reduce the training text size while still keeping all the
characters that you need as part of the training text,

On Tue, Jun 18, 2019 at 2:24 AM Jingjing Lin  wrote:

> I was only using two different fonts and It only achieved lowest error
> rate of 11.271 after the training, does this mean I really need to increase
> the iterations?
>
> 在 2019年6月17日星期一 UTC-4下午2:16:31,shree写道:
>>
>> How big was your training text? How many iterations? Did the fonts you
>> use for training support the plus minus sign?
>>
>> You can run training with -- debug-level of -1 so that you can see
>> whether the plus minus is being picked for training in the console messages.
>>
>> On Mon, 17 Jun 2019, 23:29 Jingjing Lin,  wrote:
>>
>>> Thanks. It works. The new character I added was there.
>>>
>>> Do you have any idea why after fine tuning tesseract still couldn't
>>> recognize the new character I added? When I tried to add '±' to eng it
>>> works, but when I tried to add '±' to chi_sim, it couldn't work (explained
>>> below). Is there anything we need to pay attention to when fine tuning
>>> other langs rather than eng?
>>>
>>> I used
>>>
>>> lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
>>>   --traineddata ~/tesstutorial/trainplusminus/chi_sim/chi_sim.traineddata \
>>>   --eval_listfile ~/tesstutorial/evalplusminus/chi_sim.training_files.txt 
>>> 2>&1 |
>>>   grep ±
>>>
>>> to check and ± only shows up in Truth but not in OCR
>>>
>>>
>>> 在 2019年6月17日星期一 UTC-4上午11:31:24,shree写道:

 combine_tessdata -u new.traineddata new.

 will unpack the traineddata file. check new.lstm-unicharset in it

 On Monday, June 17, 2019 at 8:20:24 PM UTC+5:30, Jingjing Lin wrote:
>
> I tried to fine tune the model and add a new character via training,
> but it seems it still couldn't recognize this new character using the new
> traineddata generated. To debug I want to check whether this new character
> is in the .unicharset in the new traineddata generated. Is there a way to
> do this?
>
 --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesser...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/d251e677-5f9d-4f8f-b41a-aa015538ca47%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/692ad4d1-ff8e-4a67-a582-645a3fa5b941%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVE5eVX9ZKRVqFb8RVyAY5ZcxVwTeosrk1-kA4CuitfeA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-17 Thread Shree Devi Kumar
Yes, each iteration is one line.

For eng, the langdata training text is about 80 lines and you add 15
symbols for plus minus. With 30 fonts, you will have about 2400 lines. So
in 3600 iterations, all samples will be seen and trained.

For chi_sim with larger training text it will be different.

See https://github.com/Shreeshrii/tess4training for details of training
tutorial.





On Tue, 18 Jun 2019, 02:20 Jingjing Lin,  wrote:

> The training text was only about 2200 lines (200kB) and I used iteration
> of 3600. The fonts I used support ±.
>
> What do you mean by 'whether ± is being picked for training'? When I set
> --debug_interval -1 I found in every iteration it only outputs one line,
> does that mean in every iteration only one line is being used for training??
>
> 在 2019年6月17日星期一 UTC-4下午2:16:31,shree写道:
>>
>> How big was your training text? How many iterations? Did the fonts you
>> use for training support the plus minus sign?
>>
>> You can run training with -- debug-level of -1 so that you can see
>> whether the plus minus is being picked for training in the console messages.
>>
>> On Mon, 17 Jun 2019, 23:29 Jingjing Lin,  wrote:
>>
>>> Thanks. It works. The new character I added was there.
>>>
>>> Do you have any idea why after fine tuning tesseract still couldn't
>>> recognize the new character I added? When I tried to add '±' to eng it
>>> works, but when I tried to add '±' to chi_sim, it couldn't work (explained
>>> below). Is there anything we need to pay attention to when fine tuning
>>> other langs rather than eng?
>>>
>>> I used
>>>
>>> lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
>>>   --traineddata ~/tesstutorial/trainplusminus/chi_sim/chi_sim.traineddata \
>>>   --eval_listfile ~/tesstutorial/evalplusminus/chi_sim.training_files.txt 
>>> 2>&1 |
>>>   grep ±
>>>
>>> to check and ± only shows up in Truth but not in OCR
>>>
>>>
>>> 在 2019年6月17日星期一 UTC-4上午11:31:24,shree写道:

 combine_tessdata -u new.traineddata new.

 will unpack the traineddata file. check new.lstm-unicharset in it

 On Monday, June 17, 2019 at 8:20:24 PM UTC+5:30, Jingjing Lin wrote:
>
> I tried to fine tune the model and add a new character via training,
> but it seems it still couldn't recognize this new character using the new
> traineddata generated. To debug I want to check whether this new character
> is in the .unicharset in the new traineddata generated. Is there a way to
> do this?
>
 --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesser...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/d251e677-5f9d-4f8f-b41a-aa015538ca47%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/f408c974-aa0b-4df9-a364-d1f0ca2a8a1c%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXJ3KQKgFqxMPDmvEqCFZizE3fsv9b79F4H3GZUV1cBMg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-17 Thread Jingjing Lin
when I checked with --debug_interval -1 I found that although ± is in the 
GROUND TRUTH, it always showed as + or something else but not ± in the BEST 
OCR TEXT. What can I do in this situation?

在 2019年6月17日星期一 UTC-4下午2:16:31,shree写道:
>
> How big was your training text? How many iterations? Did the fonts you use 
> for training support the plus minus sign? 
>
> You can run training with -- debug-level of -1 so that you can see whether 
> the plus minus is being picked for training in the console messages.
>
> On Mon, 17 Jun 2019, 23:29 Jingjing Lin, > 
> wrote:
>
>> Thanks. It works. The new character I added was there.
>>
>> Do you have any idea why after fine tuning tesseract still couldn't 
>> recognize the new character I added? When I tried to add '±' to eng it 
>> works, but when I tried to add '±' to chi_sim, it couldn't work (explained 
>> below). Is there anything we need to pay attention to when fine tuning 
>> other langs rather than eng?
>>
>> I used 
>>
>> lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
>>   --traineddata ~/tesstutorial/trainplusminus/chi_sim/chi_sim.traineddata \
>>   --eval_listfile ~/tesstutorial/evalplusminus/chi_sim.training_files.txt 
>> 2>&1 |
>>   grep ±
>>
>> to check and ± only shows up in Truth but not in OCR
>>
>>
>> 在 2019年6月17日星期一 UTC-4上午11:31:24,shree写道:
>>>
>>> combine_tessdata -u new.traineddata new.
>>>
>>> will unpack the traineddata file. check new.lstm-unicharset in it
>>>
>>> On Monday, June 17, 2019 at 8:20:24 PM UTC+5:30, Jingjing Lin wrote:

 I tried to fine tune the model and add a new character via training, 
 but it seems it still couldn't recognize this new character using the new 
 traineddata generated. To debug I want to check whether this new character 
 is in the .unicharset in the new traineddata generated. Is there a way to 
 do this?

>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesser...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/d251e677-5f9d-4f8f-b41a-aa015538ca47%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f6d46170-15b7-4360-a6fb-027137dee640%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-17 Thread Jingjing Lin
I was only using two different fonts and It only achieved lowest error rate 
of 11.271 after the training, does this mean I really need to increase the 
iterations?

在 2019年6月17日星期一 UTC-4下午2:16:31,shree写道:
>
> How big was your training text? How many iterations? Did the fonts you use 
> for training support the plus minus sign? 
>
> You can run training with -- debug-level of -1 so that you can see whether 
> the plus minus is being picked for training in the console messages.
>
> On Mon, 17 Jun 2019, 23:29 Jingjing Lin, > 
> wrote:
>
>> Thanks. It works. The new character I added was there.
>>
>> Do you have any idea why after fine tuning tesseract still couldn't 
>> recognize the new character I added? When I tried to add '±' to eng it 
>> works, but when I tried to add '±' to chi_sim, it couldn't work (explained 
>> below). Is there anything we need to pay attention to when fine tuning 
>> other langs rather than eng?
>>
>> I used 
>>
>> lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
>>   --traineddata ~/tesstutorial/trainplusminus/chi_sim/chi_sim.traineddata \
>>   --eval_listfile ~/tesstutorial/evalplusminus/chi_sim.training_files.txt 
>> 2>&1 |
>>   grep ±
>>
>> to check and ± only shows up in Truth but not in OCR
>>
>>
>> 在 2019年6月17日星期一 UTC-4上午11:31:24,shree写道:
>>>
>>> combine_tessdata -u new.traineddata new.
>>>
>>> will unpack the traineddata file. check new.lstm-unicharset in it
>>>
>>> On Monday, June 17, 2019 at 8:20:24 PM UTC+5:30, Jingjing Lin wrote:

 I tried to fine tune the model and add a new character via training, 
 but it seems it still couldn't recognize this new character using the new 
 traineddata generated. To debug I want to check whether this new character 
 is in the .unicharset in the new traineddata generated. Is there a way to 
 do this?

>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesser...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/d251e677-5f9d-4f8f-b41a-aa015538ca47%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/692ad4d1-ff8e-4a67-a582-645a3fa5b941%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-17 Thread Jingjing Lin
The training text was only about 2200 lines (200kB) and I used iteration of 
3600. The fonts I used support ±. 

What do you mean by 'whether ± is being picked for training'? When I set 
--debug_interval -1 I found in every iteration it only outputs one line, 
does that mean in every iteration only one line is being used for training??

在 2019年6月17日星期一 UTC-4下午2:16:31,shree写道:
>
> How big was your training text? How many iterations? Did the fonts you use 
> for training support the plus minus sign? 
>
> You can run training with -- debug-level of -1 so that you can see whether 
> the plus minus is being picked for training in the console messages.
>
> On Mon, 17 Jun 2019, 23:29 Jingjing Lin, > 
> wrote:
>
>> Thanks. It works. The new character I added was there.
>>
>> Do you have any idea why after fine tuning tesseract still couldn't 
>> recognize the new character I added? When I tried to add '±' to eng it 
>> works, but when I tried to add '±' to chi_sim, it couldn't work (explained 
>> below). Is there anything we need to pay attention to when fine tuning 
>> other langs rather than eng?
>>
>> I used 
>>
>> lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
>>   --traineddata ~/tesstutorial/trainplusminus/chi_sim/chi_sim.traineddata \
>>   --eval_listfile ~/tesstutorial/evalplusminus/chi_sim.training_files.txt 
>> 2>&1 |
>>   grep ±
>>
>> to check and ± only shows up in Truth but not in OCR
>>
>>
>> 在 2019年6月17日星期一 UTC-4上午11:31:24,shree写道:
>>>
>>> combine_tessdata -u new.traineddata new.
>>>
>>> will unpack the traineddata file. check new.lstm-unicharset in it
>>>
>>> On Monday, June 17, 2019 at 8:20:24 PM UTC+5:30, Jingjing Lin wrote:

 I tried to fine tune the model and add a new character via training, 
 but it seems it still couldn't recognize this new character using the new 
 traineddata generated. To debug I want to check whether this new character 
 is in the .unicharset in the new traineddata generated. Is there a way to 
 do this?

>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesser...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/d251e677-5f9d-4f8f-b41a-aa015538ca47%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f408c974-aa0b-4df9-a364-d1f0ca2a8a1c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-17 Thread Shree Devi Kumar
How big was your training text? How many iterations? Did the fonts you use
for training support the plus minus sign?

You can run training with -- debug-level of -1 so that you can see whether
the plus minus is being picked for training in the console messages.

On Mon, 17 Jun 2019, 23:29 Jingjing Lin,  wrote:

> Thanks. It works. The new character I added was there.
>
> Do you have any idea why after fine tuning tesseract still couldn't
> recognize the new character I added? When I tried to add '±' to eng it
> works, but when I tried to add '±' to chi_sim, it couldn't work (explained
> below). Is there anything we need to pay attention to when fine tuning
> other langs rather than eng?
>
> I used
>
> lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
>   --traineddata ~/tesstutorial/trainplusminus/chi_sim/chi_sim.traineddata \
>   --eval_listfile ~/tesstutorial/evalplusminus/chi_sim.training_files.txt 
> 2>&1 |
>   grep ±
>
> to check and ± only shows up in Truth but not in OCR
>
>
> 在 2019年6月17日星期一 UTC-4上午11:31:24,shree写道:
>>
>> combine_tessdata -u new.traineddata new.
>>
>> will unpack the traineddata file. check new.lstm-unicharset in it
>>
>> On Monday, June 17, 2019 at 8:20:24 PM UTC+5:30, Jingjing Lin wrote:
>>>
>>> I tried to fine tune the model and add a new character via training, but
>>> it seems it still couldn't recognize this new character using the new
>>> traineddata generated. To debug I want to check whether this new character
>>> is in the .unicharset in the new traineddata generated. Is there a way to
>>> do this?
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/d251e677-5f9d-4f8f-b41a-aa015538ca47%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVjKKD%2B%3DPGNQB249yrndmQH_fo4P%2BtxHfvCbO-2hnH5_g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-17 Thread Jingjing Lin
Thanks. It works. The new character I added was there.

Do you have any idea why after fine tuning tesseract still couldn't 
recognize the new character I added? When I tried to add '±' to eng it 
works, but when I tried to add '±' to chi_sim, it couldn't work (explained 
below). Is there anything we need to pay attention to when fine tuning 
other langs rather than eng?

I used 

lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
  --traineddata ~/tesstutorial/trainplusminus/chi_sim/chi_sim.traineddata \
  --eval_listfile ~/tesstutorial/evalplusminus/chi_sim.training_files.txt 2>&1 |
  grep ±

to check and ± only shows up in Truth but not in OCR


在 2019年6月17日星期一 UTC-4上午11:31:24,shree写道:
>
> combine_tessdata -u new.traineddata new.
>
> will unpack the traineddata file. check new.lstm-unicharset in it
>
> On Monday, June 17, 2019 at 8:20:24 PM UTC+5:30, Jingjing Lin wrote:
>>
>> I tried to fine tune the model and add a new character via training, but 
>> it seems it still couldn't recognize this new character using the new 
>> traineddata generated. To debug I want to check whether this new character 
>> is in the .unicharset in the new traineddata generated. Is there a way to 
>> do this?
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d251e677-5f9d-4f8f-b41a-aa015538ca47%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-17 Thread shree
combine_tessdata -u new.traineddata new.

will unpack the traineddata file. check new.lstm-unicharset in it

On Monday, June 17, 2019 at 8:20:24 PM UTC+5:30, Jingjing Lin wrote:
>
> I tried to fine tune the model and add a new character via training, but 
> it seems it still couldn't recognize this new character using the new 
> traineddata generated. To debug I want to check whether this new character 
> is in the .unicharset in the new traineddata generated. Is there a way to 
> do this?
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c51c6d03-ec61-452e-8ca9-76602c30c29f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.