Please see https://github.com/tesseract-ocr/tesseract/issues/3001 for 
updates

On Saturday, January 9, 2021 at 10:19:02 PM UTC+5:30 qntmm...@gmail.com 
wrote:

>
> How do I create my own custom unicharset file? The tesstrain script seems 
> to be generating one based on the training text but I want to pass in my 
> own unicharset file. 
> On Friday, January 8, 2021 at 12:58:27 AM UTC-6 shree wrote:
>
>> Are any of these vertical fonts?
>>
>> Encoding errors could be if the characters in training text are not in 
>> the unicharset.
>>
>> On Fri, Jan 8, 2021, 00:46 Kamui 7 <qntmm...@gmail.com> wrote:
>>
>>> Looks like that fixed bug #1. Now it is able to successfully create 400 
>>> pages. Do you have any ideas as to why the other 2 errors are occurring?
>>> On Thursday, January 7, 2021 at 11:28:12 AM UTC-6 shree wrote:
>>>
>>>> Your training text file is only 175 lines, so the rendered image fits 
>>>> in 4 pages. You need to use a larger text if you want more pages.
>>>>
>>>> Also check that your fonts support both English and Japanese as the 
>>>> text seems to have samples of both languages.
>>>>
>>>> On Thu, Jan 7, 2021, 22:40 Kamui 7 <qntmm...@gmail.com> wrote:
>>>>
>>>>> I did a find command in the root directory and searched for the 
>>>>> tesstrain script. It could only find the script that i pulled from the 
>>>>> latest tesseract git repo. My training script calls that specific 
>>>>> tesstrain 
>>>>> script using a relative path so it couldn't be an older version
>>>>>
>>>>> On Thursday, January 7, 2021 at 11:01:55 AM UTC-6 shree wrote:
>>>>>
>>>>>> Old versions of tesstrain.sh used to limit training to 3 pages. Looks 
>>>>>> like you may have an old version in the path somewhere.
>>>>>>
>>>>>> On Thu, Jan 7, 2021 at 10:17 PM Kamui 7 <qntmm...@gmail.com> wrote:
>>>>>>
>>>>>>> I have a script to train tesseract and I ran it on Arch Linux, 
>>>>>>> Debian, and even a docker container and they all produce the same 
>>>>>>> errors. I 
>>>>>>> checked to make sure the script is correct as well. 
>>>>>>>
>>>>>>> Bug 1:
>>>>>>> This happens when tesstrain runs text2image. The max pages parameter 
>>>>>>> does not work at all. It ends up only rendering 4 pages regardless of 
>>>>>>> what 
>>>>>>> I pass in for the maxpages parameter. I even tried hardcoding it into 
>>>>>>> the 
>>>>>>> tesstrain_utils.sh file and it still does the same thing. 
>>>>>>>
>>>>>>> Bug 2:
>>>>>>> After it finishes producing those 4 pages, i finetune it with 
>>>>>>> lstmtraining and the resulting output is full of "Encoding of string 
>>>>>>> failed!" errors.
>>>>>>>
>>>>>>> Bug 3:
>>>>>>> Along with those encoding errors, it also outputs the following text:
>>>>>>>
>>>>>>> "Image too small to scale!! (2x48 vs min width of 3)
>>>>>>> Line cannot be recognized!!
>>>>>>> Image not trainable"
>>>>>>>
>>>>>>> I will upload my script along with the Dockerfile if anyone wants to 
>>>>>>> take a look. 
>>>>>>>
>>>>>>>
>>>>>>> https://drive.google.com/file/d/1FkW1q1cXwOxY6Yi1A1cMzInbtJa9L01M/view?usp=sharing
>>>>>>>
>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>> To view this discussion on the web visit 
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/7a9415d6-4d0c-4333-98c0-2628720661ebn%40googlegroups.com
>>>>>>>  
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7a9415d6-4d0c-4333-98c0-2628720661ebn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>>
>>>>>> ____________________________________________________________
>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>>
>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/42a49dfd-7b52-437e-8840-9dbdddbad0aen%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/42a49dfd-7b52-437e-8840-9dbdddbad0aen%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesseract-oc...@googlegroups.com.
>>>
>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/61c8baf6-837f-47f9-ab1d-bc636722194an%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/61c8baf6-837f-47f9-ab1d-bc636722194an%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/58d05cc4-9ece-44cc-a3ad-2938c2a716d6n%40googlegroups.com.

Reply via email to