As per the info given by Ray Smith, lead developer of tesseract, if you
just need to fine-tune for a new font face, use fine-tune by impact.

His example uses the training text from langdata repo (approx 80 lines)
rendered with the font, generating lstmf files and then running
lstmtraining on that for about 400 iterations.

Using too few lines or too many iterations will lead to suboptimal results.

You can whitelist only digits to further improve your results.

The above info is for lstm training - neural network based. That is the
only one that allows fine-tuning.

Your second approach is for the legacy engine. That does not have any
option for fine-tuning.

You can see shreeshrii/tess4training repo for my replication of the
tesstutorials by Ray.

On Fri, Apr 3, 2020, 16:40 hmaster <[email protected]> wrote:

> Hello,
>
> I am trying to improve accuracy for my use case, by fine tuning. Currently
> I'm getting between 80-90% accuracy on my scanned images, and around 60%
> for images taken via phone.
> I'm running on a Jetson Nano, using:
> ```
> tesseract 4.1.1-rc2-21-gf4ef
>  leptonica-1.78.0
>   libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 :
> libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
>  Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1
> ```
>
> I'm training on a single image, just to understand the mechanism, and
> learn about it.
> I'm using a scanned receipt, as an example, 600dpi. Identity, and
> imagemagick says it's 1696x3930.
>
> I'm confused a bit by this, as the script still runs, and the error rate
> keeps dropping.
> I've read the tutorials and examples, and the scripts, and it's all too
> much for now, as I've been at it for about 2-3 weeks now.
>
> There are a couple of things that are still unclear to me, and have some
> questions:
>
> 1. Do I need to create single line images from each image I have? (~3000)
> 2. would it help if I create ground-truth text files - for the entire
> image, or should I create only for a single line? (that is I must have
> tiff, box and ground-truth files for each image)
> 3. some of the words in my images are not found in the
> eng.training_files.txt, as such would it speed up/help if I add them?
> 4. is there a way to do fine tuning with my own images and my own
> eng.training_files.txt data, without running tesstrain.sh?
>
> I could not find details about how to train/fine tune with own tif/box.
> Meaning, I have created a folder with my data, and passed it to
> tesstrain.sh via my_box_tiff_dir, but it's not using those, from what I can
> tell, as it creates synth data.
> As said above, it's unclear to me if I need to generate the ground-truth
> data as well, do I still need to fiddle/fix the box files, etc.
>
> Sorry if I asked too many questions, I've invested so much time in it, and
> I'm not sure where exactly I'm doing wrong.
>
> I've followed the steps in few of the questions posted in this group, and
> I am getting decent results, however, they are not as good as using the
> traineddata_best on its own.
>
> Steps I've done were:
>
> *Method 1*
> 1. create box files via lstmbox and fix any mistakes - tesseract img.tif
> img --dpi 600 lstmbox
> 2. extract lstm from eng.traneddata_best
> 3. run lstmtraining for fine tuning - lstmtraining --continue from...
> 4. generate eng.traineddata - lstmtraining stop...
>
> *Method 2*
> 1. create box files via lstmbox and fix any mistakes - tesseract img.tif
> img --dpi 600 lstmbox
> 2. create lstmf files - tesseract img.tif img --dpi 600 lstm.train
> 3. extract unicharset - unicharset_extractor *.box
> 4. shapeclustering -F font_properties -U unicharset *.tr
> 5. mftraining -F font_properties -U unicharset -O eng.unicharset *.tr
> 6. cntraining *.tr
> 7. rename inttemp, normproto, pffmtable, shapetable
> 8. combine_tessdata eng.
>
> Thank you for your support and help with my endeavor.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/a2a43c7e-c658-4d22-af1c-32dbd1d5b2f4%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/a2a43c7e-c658-4d22-af1c-32dbd1d5b2f4%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVk%2B1xeQnKQ88o4W5CvYaPkpLiKxVn6nJBEoG%3DnU%2Bw88g%40mail.gmail.com.

Reply via email to