Are any of these vertical fonts? Encoding errors could be if the characters in training text are not in the unicharset.
On Fri, Jan 8, 2021, 00:46 Kamui 7 <qntmmag...@gmail.com> wrote: > Looks like that fixed bug #1. Now it is able to successfully create 400 > pages. Do you have any ideas as to why the other 2 errors are occurring? > On Thursday, January 7, 2021 at 11:28:12 AM UTC-6 shree wrote: > >> Your training text file is only 175 lines, so the rendered image fits in >> 4 pages. You need to use a larger text if you want more pages. >> >> Also check that your fonts support both English and Japanese as the text >> seems to have samples of both languages. >> >> On Thu, Jan 7, 2021, 22:40 Kamui 7 <qntmm...@gmail.com> wrote: >> >>> I did a find command in the root directory and searched for the >>> tesstrain script. It could only find the script that i pulled from the >>> latest tesseract git repo. My training script calls that specific tesstrain >>> script using a relative path so it couldn't be an older version >>> >>> On Thursday, January 7, 2021 at 11:01:55 AM UTC-6 shree wrote: >>> >>>> Old versions of tesstrain.sh used to limit training to 3 pages. Looks >>>> like you may have an old version in the path somewhere. >>>> >>>> On Thu, Jan 7, 2021 at 10:17 PM Kamui 7 <qntmm...@gmail.com> wrote: >>>> >>>>> I have a script to train tesseract and I ran it on Arch Linux, Debian, >>>>> and even a docker container and they all produce the same errors. I >>>>> checked >>>>> to make sure the script is correct as well. >>>>> >>>>> Bug 1: >>>>> This happens when tesstrain runs text2image. The max pages parameter >>>>> does not work at all. It ends up only rendering 4 pages regardless of what >>>>> I pass in for the maxpages parameter. I even tried hardcoding it into the >>>>> tesstrain_utils.sh file and it still does the same thing. >>>>> >>>>> Bug 2: >>>>> After it finishes producing those 4 pages, i finetune it with >>>>> lstmtraining and the resulting output is full of "Encoding of string >>>>> failed!" errors. >>>>> >>>>> Bug 3: >>>>> Along with those encoding errors, it also outputs the following text: >>>>> >>>>> "Image too small to scale!! (2x48 vs min width of 3) >>>>> Line cannot be recognized!! >>>>> Image not trainable" >>>>> >>>>> I will upload my script along with the Dockerfile if anyone wants to >>>>> take a look. >>>>> >>>>> >>>>> https://drive.google.com/file/d/1FkW1q1cXwOxY6Yi1A1cMzInbtJa9L01M/view?usp=sharing >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesseract-oc...@googlegroups.com. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/7a9415d6-4d0c-4333-98c0-2628720661ebn%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7a9415d6-4d0c-4333-98c0-2628720661ebn%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> >>>> >>>> -- >>>> >>>> ____________________________________________________________ >>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com. >>> >> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/42a49dfd-7b52-437e-8840-9dbdddbad0aen%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/42a49dfd-7b52-437e-8840-9dbdddbad0aen%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/61c8baf6-837f-47f9-ab1d-bc636722194an%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/61c8baf6-837f-47f9-ab1d-bc636722194an%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWt8uD6B35z21cXXiMMwra3jPHyu2KZ4euZ7XmxUe62WA%40mail.gmail.com.