How do I create my own custom unicharset file? The tesstrain script seems to be generating one based on the training text but I want to pass in my own unicharset file. On Friday, January 8, 2021 at 12:58:27 AM UTC-6 shree wrote:
> Are any of these vertical fonts? > > Encoding errors could be if the characters in training text are not in the > unicharset. > > On Fri, Jan 8, 2021, 00:46 Kamui 7 <qntmm...@gmail.com> wrote: > >> Looks like that fixed bug #1. Now it is able to successfully create 400 >> pages. Do you have any ideas as to why the other 2 errors are occurring? >> On Thursday, January 7, 2021 at 11:28:12 AM UTC-6 shree wrote: >> >>> Your training text file is only 175 lines, so the rendered image fits in >>> 4 pages. You need to use a larger text if you want more pages. >>> >>> Also check that your fonts support both English and Japanese as the text >>> seems to have samples of both languages. >>> >>> On Thu, Jan 7, 2021, 22:40 Kamui 7 <qntmm...@gmail.com> wrote: >>> >>>> I did a find command in the root directory and searched for the >>>> tesstrain script. It could only find the script that i pulled from the >>>> latest tesseract git repo. My training script calls that specific >>>> tesstrain >>>> script using a relative path so it couldn't be an older version >>>> >>>> On Thursday, January 7, 2021 at 11:01:55 AM UTC-6 shree wrote: >>>> >>>>> Old versions of tesstrain.sh used to limit training to 3 pages. Looks >>>>> like you may have an old version in the path somewhere. >>>>> >>>>> On Thu, Jan 7, 2021 at 10:17 PM Kamui 7 <qntmm...@gmail.com> wrote: >>>>> >>>>>> I have a script to train tesseract and I ran it on Arch Linux, >>>>>> Debian, and even a docker container and they all produce the same >>>>>> errors. I >>>>>> checked to make sure the script is correct as well. >>>>>> >>>>>> Bug 1: >>>>>> This happens when tesstrain runs text2image. The max pages parameter >>>>>> does not work at all. It ends up only rendering 4 pages regardless of >>>>>> what >>>>>> I pass in for the maxpages parameter. I even tried hardcoding it into >>>>>> the >>>>>> tesstrain_utils.sh file and it still does the same thing. >>>>>> >>>>>> Bug 2: >>>>>> After it finishes producing those 4 pages, i finetune it with >>>>>> lstmtraining and the resulting output is full of "Encoding of string >>>>>> failed!" errors. >>>>>> >>>>>> Bug 3: >>>>>> Along with those encoding errors, it also outputs the following text: >>>>>> >>>>>> "Image too small to scale!! (2x48 vs min width of 3) >>>>>> Line cannot be recognized!! >>>>>> Image not trainable" >>>>>> >>>>>> I will upload my script along with the Dockerfile if anyone wants to >>>>>> take a look. >>>>>> >>>>>> >>>>>> https://drive.google.com/file/d/1FkW1q1cXwOxY6Yi1A1cMzInbtJa9L01M/view?usp=sharing >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/7a9415d6-4d0c-4333-98c0-2628720661ebn%40googlegroups.com >>>>>> >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7a9415d6-4d0c-4333-98c0-2628720661ebn%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> ____________________________________________________________ >>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-oc...@googlegroups.com. >>>> >>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/42a49dfd-7b52-437e-8840-9dbdddbad0aen%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/42a49dfd-7b52-437e-8840-9dbdddbad0aen%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-oc...@googlegroups.com. >> > To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/61c8baf6-837f-47f9-ab1d-bc636722194an%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/61c8baf6-837f-47f9-ab1d-bc636722194an%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/31ff2130-1274-4e8c-8fc3-8c103a2fa5b7n%40googlegroups.com.