Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

Shree Devi Kumar Mon, 17 Jun 2019 21:25:30 -0700

Yes, each iteration is one line.

For eng, the langdata training text is about 80 lines and you add 15
symbols for plus minus. With 30 fonts, you will have about 2400 lines. So
in 3600 iterations, all samples will be seen and trained.


For chi_sim with larger training text it will be different.

See https://github.com/Shreeshrii/tess4training for details of training
tutorial.





On Tue, 18 Jun 2019, 02:20 Jingjing Lin, <joejoeu...@gmail.com> wrote:

> The training text was only about 2200 lines (200kB) and I used iteration
> of 3600. The fonts I used support ±.
>
> What do you mean by 'whether ± is being picked for training'? When I set
> --debug_interval -1 I found in every iteration it only outputs one line,
> does that mean in every iteration only one line is being used for training??
>
> 在 2019年6月17日星期一 UTC-4下午2:16:31，shree写道：
>>
>> How big was your training text? How many iterations? Did the fonts you
>> use for training support the plus minus sign?
>>
>> You can run training with -- debug-level of -1 so that you can see
>> whether the plus minus is being picked for training in the console messages.
>>
>> On Mon, 17 Jun 2019, 23:29 Jingjing Lin, <joejo...@gmail.com> wrote:
>>
>>> Thanks. It works. The new character I added was there.
>>>
>>> Do you have any idea why after fine tuning tesseract still couldn't
>>> recognize the new character I added? When I tried to add '±' to eng it
>>> works, but when I tried to add '±' to chi_sim, it couldn't work (explained
>>> below). Is there anything we need to pay attention to when fine tuning
>>> other langs rather than eng?
>>>
>>> I used
>>>
>>> lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
>>>   --traineddata ~/tesstutorial/trainplusminus/chi_sim/chi_sim.traineddata \
>>>   --eval_listfile ~/tesstutorial/evalplusminus/chi_sim.training_files.txt 
>>> 2>&1 |
>>>   grep ±
>>>
>>> to check and ± only shows up in Truth but not in OCR
>>>
>>>
>>> 在 2019年6月17日星期一 UTC-4上午11:31:24，shree写道：
>>>>
>>>> combine_tessdata -u new.traineddata new.
>>>>
>>>> will unpack the traineddata file. check new.lstm-unicharset in it
>>>>
>>>> On Monday, June 17, 2019 at 8:20:24 PM UTC+5:30, Jingjing Lin wrote:
>>>>>
>>>>> I tried to fine tune the model and add a new character via training,
>>>>> but it seems it still couldn't recognize this new character using the new
>>>>> traineddata generated. To debug I want to check whether this new character
>>>>> is in the .unicharset in the new traineddata generated. Is there a way to
>>>>> do this?
>>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesser...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/d251e677-5f9d-4f8f-b41a-aa015538ca47%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/d251e677-5f9d-4f8f-b41a-aa015538ca47%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/f408c974-aa0b-4df9-a364-d1f0ca2a8a1c%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/f408c974-aa0b-4df9-a364-d1f0ca2a8a1c%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXJ3KQKgFqxMPDmvEqCFZizE3fsv9b79F4H3GZUV1cBMg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

Reply via email to