Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-17 Thread Shree Devi Kumar
If you increase the iterations then the plus type of training will not give good result, i.e. the other letters will lose accuracy. You can try to reduce the training text size while still keeping all the characters that you need as part of the training text, On Tue, Jun 18, 2019 at 2:24 AM

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-17 Thread Shree Devi Kumar
Yes, each iteration is one line. For eng, the langdata training text is about 80 lines and you add 15 symbols for plus minus. With 30 fonts, you will have about 2400 lines. So in 3600 iterations, all samples will be seen and trained. For chi_sim with larger training text it will be different.

Re: [tesseract-ocr] Trained data for E13B font

2019-06-17 Thread ElGato ElMago
I guess the cotent of training text is important when you add new characters. I had the same issue at first and then shree suggested a larger text and more iterations. I thought variation in the text would matter as well. I'm getting good results after I prepared good training text. Now,

[tesseract-ocr] Re: Training on cloud

2019-06-17 Thread ElGato ElMago
Raspberry Pi 3B is enough for me. It takes 1 to 2 days depending on what training. 2019年6月18日火曜日 7時50分04秒 UTC+9 Mox Betex: > > I was thinking of paying for Dedicated Server on >> https://www.germanvps.com/hg-linux-kvm-hosting.php to train data. >> > > Can someone tell me is this server enough

[tesseract-ocr] Re: Training on cloud

2019-06-17 Thread Mox Betex
> > I was thinking of paying for Dedicated Server on > https://www.germanvps.com/hg-linux-kvm-hosting.php to train data. > Can someone tell me is this server enough to train data fast? How long can training last with this specification? - 8 Core Intel Xeon 2.60GHz, 32GB DDR4 -- You

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-17 Thread Jingjing Lin
when I checked with --debug_interval -1 I found that although ± is in the GROUND TRUTH, it always showed as + or something else but not ± in the BEST OCR TEXT. What can I do in this situation? 在 2019年6月17日星期一 UTC-4下午2:16:31,shree写道: > > How big was your training text? How many iterations? Did

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-17 Thread Jingjing Lin
I was only using two different fonts and It only achieved lowest error rate of 11.271 after the training, does this mean I really need to increase the iterations? 在 2019年6月17日星期一 UTC-4下午2:16:31,shree写道: > > How big was your training text? How many iterations? Did the fonts you use > for

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-17 Thread Jingjing Lin
The training text was only about 2200 lines (200kB) and I used iteration of 3600. The fonts I used support ±. What do you mean by 'whether ± is being picked for training'? When I set --debug_interval -1 I found in every iteration it only outputs one line, does that mean in every iteration

[tesseract-ocr] Re: FontAwesome and Tesseract

2019-06-17 Thread Jason
Can I "bump" this? Even if I only get a high-level description of the process? - How to make a box file (for v4) of unicode chars - How to make the training size invariant? Etc. Many thanks! On Tuesday, May 21, 2019 at 10:09:57 AM UTC-4, Jason wrote: > > I would like to be able to detect

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-17 Thread Shree Devi Kumar
How big was your training text? How many iterations? Did the fonts you use for training support the plus minus sign? You can run training with -- debug-level of -1 so that you can see whether the plus minus is being picked for training in the console messages. On Mon, 17 Jun 2019, 23:29 Jingjing

[tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-17 Thread Jingjing Lin
Thanks. It works. The new character I added was there. Do you have any idea why after fine tuning tesseract still couldn't recognize the new character I added? When I tried to add '±' to eng it works, but when I tried to add '±' to chi_sim, it couldn't work (explained below). Is there anything

Re: [tesseract-ocr] Re: lstmeval shows good result but visualized result looks bad

2019-06-17 Thread Shree Devi Kumar
I don't think you need training to improve results. You need to pre-process the image, straighten it. Use a separate tool to identify each cell of data and then OCR that. You will get best results like that. On Mon, Jun 17, 2019 at 6:07 PM phucp...@gmail.com wrote: > Thanks shree for your

[tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-17 Thread shree
combine_tessdata -u new.traineddata new. will unpack the traineddata file. check new.lstm-unicharset in it On Monday, June 17, 2019 at 8:20:24 PM UTC+5:30, Jingjing Lin wrote: > > I tried to fine tune the model and add a new character via training, but > it seems it still couldn't recognize

[tesseract-ocr] how to check .unicharset in a .traineddata file

2019-06-17 Thread Jingjing Lin
I tried to fine tune the model and add a new character via training, but it seems it still couldn't recognize this new character using the new traineddata generated. To debug I want to check whether this new character is in the .unicharset in the new traineddata generated. Is there a way to do

Re: [tesseract-ocr] Re: lstmeval shows good result but visualized result looks bad

2019-06-17 Thread phucp...@gmail.com
Thanks shree for your reply. I see that you are very busy to answer a lot of questions here. Thanks again for taking some time for me > > Your files have prefix of jpn, so I assume you are training for Japanese, > but the image in question has only numbers in it. > Well I forgot to mention, my

[tesseract-ocr] Extract words from images in a image form

2019-06-17 Thread Mox Betex
Can Tesseract (or any other software) extract words or lines from images in a image form, not text form. I have a lot of scanned images, and for training data I need to extract words and lines from those images in order make tiff/txt files for training. Is there a way to do that with some

[tesseract-ocr] Re: lstmeval shows good result but visualized result looks bad

2019-06-17 Thread shree
Your files have prefix of jpn, so I assume you are training for Japanese, but the image in question has only numbers in it. Getting good results on eval data but bad results on OCR could be the result of overfitting the model, if you have used a small sample and trained for large number of