Unicharset is extracted from training text, because those are the samples
that will be used for training.

Why do you want to use a different unicharset?


On Tue, Jan 12, 2021, 23:47 Kamui 7 <qntmmag...@gmail.com> wrote:

>
>
> Great! The PR that you submitted fixed issue #3. All that's left is the
> encoding string problem. I wonder if it's a problem with the unicharset
> extractor?
> On Monday, January 11, 2021 at 11:30:39 AM UTC-6 shree wrote:
>
>> Please see https://github.com/tesseract-ocr/tesseract/issues/3001 for
>> updates
>>
>> On Saturday, January 9, 2021 at 10:19:02 PM UTC+5:30 qntmm...@gmail.com
>> wrote:
>>
>>>
>>> How do I create my own custom unicharset file? The tesstrain script
>>> seems to be generating one based on the training text but I want to pass in
>>> my own unicharset file.
>>> On Friday, January 8, 2021 at 12:58:27 AM UTC-6 shree wrote:
>>>
>>>> Are any of these vertical fonts?
>>>>
>>>> Encoding errors could be if the characters in training text are not in
>>>> the unicharset.
>>>>
>>>> On Fri, Jan 8, 2021, 00:46 Kamui 7 <qntmm...@gmail.com> wrote:
>>>>
>>>>> Looks like that fixed bug #1. Now it is able to successfully create
>>>>> 400 pages. Do you have any ideas as to why the other 2 errors are 
>>>>> occurring?
>>>>> On Thursday, January 7, 2021 at 11:28:12 AM UTC-6 shree wrote:
>>>>>
>>>>>> Your training text file is only 175 lines, so the rendered image fits
>>>>>> in 4 pages. You need to use a larger text if you want more pages.
>>>>>>
>>>>>> Also check that your fonts support both English and Japanese as the
>>>>>> text seems to have samples of both languages.
>>>>>>
>>>>>> On Thu, Jan 7, 2021, 22:40 Kamui 7 <qntmm...@gmail.com> wrote:
>>>>>>
>>>>>>> I did a find command in the root directory and searched for the
>>>>>>> tesstrain script. It could only find the script that i pulled from the
>>>>>>> latest tesseract git repo. My training script calls that specific 
>>>>>>> tesstrain
>>>>>>> script using a relative path so it couldn't be an older version
>>>>>>>
>>>>>>> On Thursday, January 7, 2021 at 11:01:55 AM UTC-6 shree wrote:
>>>>>>>
>>>>>>>> Old versions of tesstrain.sh used to limit training to 3 pages.
>>>>>>>> Looks like you may have an old version in the path somewhere.
>>>>>>>>
>>>>>>>> On Thu, Jan 7, 2021 at 10:17 PM Kamui 7 <qntmm...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I have a script to train tesseract and I ran it on Arch Linux,
>>>>>>>>> Debian, and even a docker container and they all produce the same 
>>>>>>>>> errors. I
>>>>>>>>> checked to make sure the script is correct as well.
>>>>>>>>>
>>>>>>>>> Bug 1:
>>>>>>>>> This happens when tesstrain runs text2image. The max pages
>>>>>>>>> parameter does not work at all. It ends up only rendering 4 pages
>>>>>>>>> regardless of what I pass in for the maxpages parameter. I even tried
>>>>>>>>> hardcoding it into the tesstrain_utils.sh file and it still does the 
>>>>>>>>> same
>>>>>>>>> thing.
>>>>>>>>>
>>>>>>>>> Bug 2:
>>>>>>>>> After it finishes producing those 4 pages, i finetune it with
>>>>>>>>> lstmtraining and the resulting output is full of "Encoding of string
>>>>>>>>> failed!" errors.
>>>>>>>>>
>>>>>>>>> Bug 3:
>>>>>>>>> Along with those encoding errors, it also outputs the following
>>>>>>>>> text:
>>>>>>>>>
>>>>>>>>> "Image too small to scale!! (2x48 vs min width of 3)
>>>>>>>>> Line cannot be recognized!!
>>>>>>>>> Image not trainable"
>>>>>>>>>
>>>>>>>>> I will upload my script along with the Dockerfile if anyone wants
>>>>>>>>> to take a look.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://drive.google.com/file/d/1FkW1q1cXwOxY6Yi1A1cMzInbtJa9L01M/view?usp=sharing
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>> To view this discussion on the web visit
>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/7a9415d6-4d0c-4333-98c0-2628720661ebn%40googlegroups.com
>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7a9415d6-4d0c-4333-98c0-2628720661ebn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> ____________________________________________________________
>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>>
>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>>
>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/42a49dfd-7b52-437e-8840-9dbdddbad0aen%40googlegroups.com
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/42a49dfd-7b52-437e-8840-9dbdddbad0aen%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>>
>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/61c8baf6-837f-47f9-ab1d-bc636722194an%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/61c8baf6-837f-47f9-ab1d-bc636722194an%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/b1ff77f3-2019-4a48-8e66-331343f7979cn%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/b1ff77f3-2019-4a48-8e66-331343f7979cn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUhDU7ck_VHuGYsCE0%3Djs5nsd_nYmdC_gsYfwQj6WoD3g%40mail.gmail.com.

Reply via email to