Old thread https://groups.google.com/forum/#!searchin/tesseract-ocr/layer$20chi_sim%7Csort:date/tesseract-ocr/iFMg7Gjczq4/f7_XRop2BAAJ
On Wed, Jun 19, 2019 at 9:13 PM Shree Devi Kumar <shreesh...@gmail.com> wrote: > Update: > > 1. When using a smaller training_text for chi_sim for plus training, the > unicharset gets restricted. So, merge the lstm-unicharset with it. > > 2. The unicharset for chi_sim using langdata is different from the one > extracted from tessdata_best. so using training_text from langdata will add > more characters. > > 3. The fonts used for LSTM training are given in langdata_lstm in > okfonts.txt. For plus training same fonts should be used otherwise it will > require training of new typefaces. > > 4. Another user was trying to fine-tune chi_sim (check old forum posts) to > add theta sign. If I remember correctly, the plus type training did not > work for it. Replace top layer was the better option. > > 5. I am training with the following fonts. > "Adobe Heiti Std" \ > "Adobe Kaiti Std" \ > "Arial Unicode MS" \ > "Bitstream CyberCJK" \ > "Microsoft YaHei UI" \ > "Microsoft YaHei" \ > "NSimSun" \ > "Noto Sans CJK SC" \ > "Noto Sans Mono CJK SC" \ > "STXihei" \ > "SimSun" \ > "WenQuanYi Zen Hei Medium" \ > "WenQuanYi Zen Hei Mono Medium" \ > "WenQuanYi Zen Hei Sharp Medium" \ > > At iteration 1046/1100/1100, Mean rms=0.704%, delta=1.445%, char > train=4.888%, word train=46.842%, skip ratio=0%, New best char error = > 4.888 wrote best > model:/home/ubuntu/tesstutorial/chi_sim_plus/chi_sim_plus4.888_1046.checkpoint > wrote checkpoint. > > > On Wed, Jun 19, 2019 at 12:36 AM Jingjing Lin <joejoeu...@gmail.com> > wrote: > >> Can you please test on arrows (↑ >> <https://en.wikipedia.org/wiki/%E2%86%91_(disambiguation)> or ↓ >> <https://en.wikipedia.org/wiki/%E2%86%93_(disambiguation)>) instead of ± >> if it's not inconvenient for you? >> >> 在 2019年6月18日星期二 UTC-4下午2:21:18,shree写道: >>> >>> I will test tomorrow and let you know >>> >>> On Tue, 18 Jun 2019, 23:47 Jingjing Lin, <joejo...@gmail.com> wrote: >>> >>>> It still couldn't work after I increased the number of ± to about 100. >>>> And the error rate after 2000 iterations is about 11. This is a pretty high >>>> error rate compare to what we have for adding a few characters to eng. With >>>> such high error rate, I would not be surprised that it could't recognize >>>> some special characters like ±. Is this it for chi_sim? Or can I increase >>>> iterations to make the error rate smaller? >>>> Thanks for your help. >>>> >>>> 在 2019年6月18日星期二 UTC-4上午10:32:37,shree写道: >>>>> >>>>> increase the number of ± to about 100 >>>>> >>>>> On Tue, Jun 18, 2019 at 7:39 PM Jingjing Lin <joejo...@gmail.com> >>>>> wrote: >>>>> >>>>>> Sorry to bother you again and again. >>>>>> I reduced the training text to about 450 lines, with like 30 ± in it. >>>>>> I used two fonts and iteration of 1000. But it looks like ± is still not >>>>>> picked up by the BEST OCR TEXT at all, it always recognizes ± as >>>>>> something >>>>>> else. What is happening here? Should I increase the number of ±? Or do I >>>>>> need to increase the number of fonts? I'm trying increasing iterations. >>>>>> >>>>>> 在 2019年6月18日星期二 UTC-4上午12:28:25,shree写道: >>>>>>> >>>>>>> If you increase the iterations then the plus type of training will >>>>>>> not give good result, i.e. the other letters will lose accuracy. >>>>>>> >>>>>>> You can try to reduce the training text size while still keeping all >>>>>>> the characters that you need as part of the training text, >>>>>>> >>>>>>> On Tue, Jun 18, 2019 at 2:24 AM Jingjing Lin <joejo...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> I was only using two different fonts and It only achieved lowest >>>>>>>> error rate of 11.271 after the training, does this mean I really need >>>>>>>> to >>>>>>>> increase the iterations? >>>>>>>> >>>>>>>> 在 2019年6月17日星期一 UTC-4下午2:16:31,shree写道: >>>>>>>>> >>>>>>>>> How big was your training text? How many iterations? Did the fonts >>>>>>>>> you use for training support the plus minus sign? >>>>>>>>> >>>>>>>>> You can run training with -- debug-level of -1 so that you can see >>>>>>>>> whether the plus minus is being picked for training in the console >>>>>>>>> messages. >>>>>>>>> >>>>>>>>> On Mon, 17 Jun 2019, 23:29 Jingjing Lin, <joejo...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Thanks. It works. The new character I added was there. >>>>>>>>>> >>>>>>>>>> Do you have any idea why after fine tuning tesseract still >>>>>>>>>> couldn't recognize the new character I added? When I tried to add >>>>>>>>>> '±' to >>>>>>>>>> eng it works, but when I tried to add '±' to chi_sim, it couldn't >>>>>>>>>> work >>>>>>>>>> (explained below). Is there anything we need to pay attention to >>>>>>>>>> when fine >>>>>>>>>> tuning other langs rather than eng? >>>>>>>>>> >>>>>>>>>> I used >>>>>>>>>> >>>>>>>>>> lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \ >>>>>>>>>> --traineddata >>>>>>>>>> ~/tesstutorial/trainplusminus/chi_sim/chi_sim.traineddata \ >>>>>>>>>> --eval_listfile >>>>>>>>>> ~/tesstutorial/evalplusminus/chi_sim.training_files.txt 2>&1 | >>>>>>>>>> grep ± >>>>>>>>>> >>>>>>>>>> to check and ± only shows up in Truth but not in OCR >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 在 2019年6月17日星期一 UTC-4上午11:31:24,shree写道: >>>>>>>>>>> >>>>>>>>>>> combine_tessdata -u new.traineddata new. >>>>>>>>>>> >>>>>>>>>>> will unpack the traineddata file. check new.lstm-unicharset in it >>>>>>>>>>> >>>>>>>>>>> On Monday, June 17, 2019 at 8:20:24 PM UTC+5:30, Jingjing Lin >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> I tried to fine tune the model and add a new character via >>>>>>>>>>>> training, but it seems it still couldn't recognize this new >>>>>>>>>>>> character using >>>>>>>>>>>> the new traineddata generated. To debug I want to check whether >>>>>>>>>>>> this new >>>>>>>>>>>> character is in the .unicharset in the new traineddata generated. >>>>>>>>>>>> Is there >>>>>>>>>>>> a way to do this? >>>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>>> send an email to tesser...@googlegroups.com. >>>>>>>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr >>>>>>>>>> . >>>>>>>>>> To view this discussion on the web visit >>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/d251e677-5f9d-4f8f-b41a-aa015538ca47%40googlegroups.com >>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/d251e677-5f9d-4f8f-b41a-aa015538ca47%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>> . >>>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>>> >>>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "tesseract-ocr" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to tesser...@googlegroups.com. >>>>>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/692ad4d1-ff8e-4a67-a582-645a3fa5b941%40googlegroups.com >>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/692ad4d1-ff8e-4a67-a582-645a3fa5b941%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> ____________________________________________________________ >>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to tesser...@googlegroups.com. >>>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/6d299e90-fc12-4a52-989f-5b787db5f1f7%40googlegroups.com >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/6d299e90-fc12-4a52-989f-5b787db5f1f7%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> ____________________________________________________________ >>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesser...@googlegroups.com. >>>> To post to this group, send email to tesser...@googlegroups.com. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/d5d4c267-c6e4-41e6-b0ab-01391a1b666d%40googlegroups.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/d5d4c267-c6e4-41e6-b0ab-01391a1b666d%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-ocr+unsubscr...@googlegroups.com. >> To post to this group, send email to tesseract-ocr@googlegroups.com. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/1a993e08-1444-4791-a8b7-981c6ba0cdbd%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/1a993e08-1444-4791-a8b7-981c6ba0cdbd%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > > -- > > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > -- ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW0GXmO1Ro5NQ_yyWUMSZzwGffpW4oayMAF-bkeecmfLA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.