[tesseract-ocr] fine tuning from traineddata_best

hmaster Fri, 03 Apr 2020 04:10:18 -0700

Hello,

I am trying to improve accuracy for my use case, by fine tuning. Currently 
I'm getting between 80-90% accuracy on my scanned images, and around 60% 
for images taken via phone.
I'm running on a Jetson Nano, using:
```
tesseract 4.1.1-rc2-21-gf4ef
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 
4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1
```

I'm training on a single image, just to understand the mechanism, and learn
about it.
I'm using a scanned receipt, as an example, 600dpi. Identity, and
imagemagick says it's 1696x3930.

I'm confused a bit by this, as the script still runs, and the error rate
keeps dropping.
I've read the tutorials and examples, and the scripts, and it's all too
much for now, as I've been at it for about 2-3 weeks now.

There are a couple of things that are still unclear to me, and have some
questions:

1. Do I need to create single line images from each image I have? (~3000)
2. would it help if I create ground-truth text files - for the entire
image, or should I create only for a single line? (that is I must have
tiff, box and ground-truth files for each image)
3. some of the words in my images are not found in the
eng.training_files.txt, as such would it speed up/help if I add them?
4. is there a way to do fine tuning with my own images and my own
eng.training_files.txt data, without running tesstrain.sh?

I could not find details about how to train/fine tune with own tif/box.
Meaning, I have created a folder with my data, and passed it to
tesstrain.sh via my_box_tiff_dir, but it's not using those, from what I can
tell, as it creates synth data.
As said above, it's unclear to me if I need to generate the ground-truth
data as well, do I still need to fiddle/fix the box files, etc.

Sorry if I asked too many questions, I've invested so much time in it, and
I'm not sure where exactly I'm doing wrong.

I've followed the steps in few of the questions posted in this group, and I
am getting decent results, however, they are not as good as using the
traineddata_best on its own.

Steps I've done were:

*Method 1*
1. create box files via lstmbox and fix any mistakes - tesseract img.tif
img --dpi 600 lstmbox
2. extract lstm from eng.traneddata_best
3. run lstmtraining for fine tuning - lstmtraining --continue from...
4. generate eng.traineddata - lstmtraining stop...

*Method 2*
1. create box files via lstmbox and fix any mistakes - tesseract img.tif
img --dpi 600 lstmbox
2. create lstmf files - tesseract img.tif img --dpi 600 lstm.train
3. extract unicharset - unicharset_extractor *.box
4. shapeclustering -F font_properties -U unicharset *.tr
5. mftraining -F font_properties -U unicharset -O eng.unicharset *.tr
6. cntraining *.tr
7. rename inttemp, normproto, pffmtable, shapetable
8. combine_tessdata eng.

Thank you for your support and help with my endeavor.

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/a2a43c7e-c658-4d22-af1c-32dbd1d5b2f4%40googlegroups.com.

[tesseract-ocr] fine tuning from traineddata_best

Reply via email to