Unicharset is extracted from training text, because those are the samples that will be used for training.
Why do you want to use a different unicharset? On Tue, Jan 12, 2021, 23:47 Kamui 7 <qntmmag...@gmail.com> wrote: > > > Great! The PR that you submitted fixed issue #3. All that's left is the > encoding string problem. I wonder if it's a problem with the unicharset > extractor? > On Monday, January 11, 2021 at 11:30:39 AM UTC-6 shree wrote: > >> Please see https://github.com/tesseract-ocr/tesseract/issues/3001 for >> updates >> >> On Saturday, January 9, 2021 at 10:19:02 PM UTC+5:30 qntmm...@gmail.com >> wrote: >> >>> >>> How do I create my own custom unicharset file? The tesstrain script >>> seems to be generating one based on the training text but I want to pass in >>> my own unicharset file. >>> On Friday, January 8, 2021 at 12:58:27 AM UTC-6 shree wrote: >>> >>>> Are any of these vertical fonts? >>>> >>>> Encoding errors could be if the characters in training text are not in >>>> the unicharset. >>>> >>>> On Fri, Jan 8, 2021, 00:46 Kamui 7 <qntmm...@gmail.com> wrote: >>>> >>>>> Looks like that fixed bug #1. Now it is able to successfully create >>>>> 400 pages. Do you have any ideas as to why the other 2 errors are >>>>> occurring? >>>>> On Thursday, January 7, 2021 at 11:28:12 AM UTC-6 shree wrote: >>>>> >>>>>> Your training text file is only 175 lines, so the rendered image fits >>>>>> in 4 pages. You need to use a larger text if you want more pages. >>>>>> >>>>>> Also check that your fonts support both English and Japanese as the >>>>>> text seems to have samples of both languages. >>>>>> >>>>>> On Thu, Jan 7, 2021, 22:40 Kamui 7 <qntmm...@gmail.com> wrote: >>>>>> >>>>>>> I did a find command in the root directory and searched for the >>>>>>> tesstrain script. It could only find the script that i pulled from the >>>>>>> latest tesseract git repo. My training script calls that specific >>>>>>> tesstrain >>>>>>> script using a relative path so it couldn't be an older version >>>>>>> >>>>>>> On Thursday, January 7, 2021 at 11:01:55 AM UTC-6 shree wrote: >>>>>>> >>>>>>>> Old versions of tesstrain.sh used to limit training to 3 pages. >>>>>>>> Looks like you may have an old version in the path somewhere. >>>>>>>> >>>>>>>> On Thu, Jan 7, 2021 at 10:17 PM Kamui 7 <qntmm...@gmail.com> wrote: >>>>>>>> >>>>>>>>> I have a script to train tesseract and I ran it on Arch Linux, >>>>>>>>> Debian, and even a docker container and they all produce the same >>>>>>>>> errors. I >>>>>>>>> checked to make sure the script is correct as well. >>>>>>>>> >>>>>>>>> Bug 1: >>>>>>>>> This happens when tesstrain runs text2image. The max pages >>>>>>>>> parameter does not work at all. It ends up only rendering 4 pages >>>>>>>>> regardless of what I pass in for the maxpages parameter. I even tried >>>>>>>>> hardcoding it into the tesstrain_utils.sh file and it still does the >>>>>>>>> same >>>>>>>>> thing. >>>>>>>>> >>>>>>>>> Bug 2: >>>>>>>>> After it finishes producing those 4 pages, i finetune it with >>>>>>>>> lstmtraining and the resulting output is full of "Encoding of string >>>>>>>>> failed!" errors. >>>>>>>>> >>>>>>>>> Bug 3: >>>>>>>>> Along with those encoding errors, it also outputs the following >>>>>>>>> text: >>>>>>>>> >>>>>>>>> "Image too small to scale!! (2x48 vs min width of 3) >>>>>>>>> Line cannot be recognized!! >>>>>>>>> Image not trainable" >>>>>>>>> >>>>>>>>> I will upload my script along with the Dockerfile if anyone wants >>>>>>>>> to take a look. >>>>>>>>> >>>>>>>>> >>>>>>>>> https://drive.google.com/file/d/1FkW1q1cXwOxY6Yi1A1cMzInbtJa9L01M/view?usp=sharing >>>>>>>>> >>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "tesseract-ocr" group. >>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>>>> To view this discussion on the web visit >>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/7a9415d6-4d0c-4333-98c0-2628720661ebn%40googlegroups.com >>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7a9415d6-4d0c-4333-98c0-2628720661ebn%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>> . >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> ____________________________________________________________ >>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>> >>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/42a49dfd-7b52-437e-8840-9dbdddbad0aen%40googlegroups.com >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/42a49dfd-7b52-437e-8840-9dbdddbad0aen%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> >>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesseract-oc...@googlegroups.com. >>>>> >>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/61c8baf6-837f-47f9-ab1d-bc636722194an%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/61c8baf6-837f-47f9-ab1d-bc636722194an%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/b1ff77f3-2019-4a48-8e66-331343f7979cn%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/b1ff77f3-2019-4a48-8e66-331343f7979cn%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUhDU7ck_VHuGYsCE0%3Djs5nsd_nYmdC_gsYfwQj6WoD3g%40mail.gmail.com.