Re: [tesseract-ocr] As good as Latin.traineddata (fast integer) but faster

Shree Devi Kumar Fri, 10 Apr 2020 18:36:26 -0700

Please see
https://tesseract-ocr.github.io/tessdoc/Data-Files-in-tessdata_fast


It seems that Ray used a smaller network spec for many languages when
training for tessdata_fast to speed them up. However since their float
versions are not available, training has to be done using tessdata_best
models. That might explain the result you got.

Fine-tuning for impact does not change the model. Plus-minus or replace top
layer may do that.


On Fri, Apr 10, 2020, 19:54 O CR <[email protected]> wrote:

> Thank you for responding.
> I did the finetuning on the best Latin float model. And I converted the
> model to integer. But it's still slower then the fast integer Latin
> model....
> Any other ideas to make it faster?
>
> Op vrijdag 10 april 2020 14:17:55 UTC+2 schreef shree:
>>
>> The file is probably there as script/Latin.traineddata
>> You can copy to wherever you are looking for the best traineddata files.
>>
>> On Fri, Apr 10, 2020, 16:59 O CR <[email protected]> wrote:
>>
>>> Which language do I have to use? Because Latin isn't supported.
>>> ./tesstrain.sh --fonts_dir "/usr/share/fonts" *--lang Latin*
>>> --linedata_only  --noextract_font_properties --langdata_dir ./langdata
>>> --tessdata_dir ./tessdata  --output_dir ./output
>>>
>>> Op woensdag 8 april 2020 18:27:15 UTC+2 schreef shree:
>>>>
>>>> I suggest you fine-tune Latin.traineddata using text of the kind you
>>>> expect. It will have a smaller unicharset and when you convert to fast
>>>> integer model, it should be smaller in size.
>>>>
>>>> On Wed, Apr 8, 2020, 20:39 O CR <[email protected]> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I try to read names on images with tesseract LSTM. Names like:
>>>>>
>>>>> Śerena Kovitch
>>>>>
>>>>> ŁAGUNA EVREIST
>>>>>
>>>>> Äna Optici
>>>>>
>>>>> Orğu Moninck
>>>>>
>>>>>
>>>>> (I don't have to recognize words)
>>>>>
>>>>>
>>>>> Latin.traineddata (fast integer) is doing well with the diacritics,
>>>>> but there are a lot of characters I don't need like numbers, %, ﹕ ,﹖
>>>>> ,﹗,﹙ ,﹚ ,﹛ ,﹜ ,﹝ ,﹞ ,﹟ ,﹠ ,﹡ ,﹢ ,﹣ ,﹤,﹥,﹦ ,﹨ ,﹩ ﹪ ,﹫,and much more. And so
>>>>> Latin.traineddata is too slow.
>>>>>
>>>>> So I thought I take eng.traineddata (best float for LSTM) and I train
>>>>> it for the diacritics. But there are almost 400 diacritics. So I don't 
>>>>> know
>>>>> if fine-tuning for such amount of characters is a good idea?
>>>>>
>>>>> However I tried it but the quality is very poor.
>>>>>
>>>>> I trained with eng.training_text (a English text of 72 lines) and I
>>>>> added all the diacritics several times. The char error rate during 
>>>>> lstmeval
>>>>> is around 0.1. I did a test with 80 documents, and I read 30 names 
>>>>> correct.
>>>>> (on each document there is one name). (time is similar to 
>>>>> Latin.traineddata)
>>>>>
>>>>>
>>>>> What can I do to get a model that is as good as Latin.traineddata on
>>>>> diacritics but is much faster in ocr reading?
>>>>>
>>>>>
>>>>> Thank you.
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/b9ddf333-1229-45d3-9a02-809973294a47%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/b9ddf333-1229-45d3-9a02-809973294a47%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/d692a36f-81c4-4226-94d6-15ec8238673b%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/d692a36f-81c4-4226-94d6-15ec8238673b%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/f2e55590-d6e6-4322-b64b-5954735a6360%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/f2e55590-d6e6-4322-b64b-5954735a6360%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVi7b5GeJYinwKfYBDcgKXY%3DOYzj%2B3%3DnFQbfS4UEjK0RQ%40mail.gmail.com.

Re: [tesseract-ocr] As good as Latin.traineddata (fast integer) but faster

Reply via email to