Hello,

I am trying to improve accuracy for my use case, by fine tuning. Currently 
I'm getting between 80-90% accuracy on my scanned images, and around 60% 
for images taken via phone.
I'm running on a Jetson Nano, using:
```
tesseract 4.1.1-rc2-21-gf4ef
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 
4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1
```

I'm training on a single image, just to understand the mechanism, and learn 
about it.
I'm using a scanned receipt, as an example, 600dpi. Identity, and 
imagemagick says it's 1696x3930.

I'm confused a bit by this, as the script still runs, and the error rate 
keeps dropping.
I've read the tutorials and examples, and the scripts, and it's all too 
much for now, as I've been at it for about 2-3 weeks now.

There are a couple of things that are still unclear to me, and have some 
questions:

1. Do I need to create single line images from each image I have? (~3000)
2. would it help if I create ground-truth text files - for the entire 
image, or should I create only for a single line? (that is I must have 
tiff, box and ground-truth files for each image)
3. some of the words in my images are not found in the 
eng.training_files.txt, as such would it speed up/help if I add them?
4. is there a way to do fine tuning with my own images and my own 
eng.training_files.txt data, without running tesstrain.sh?

I could not find details about how to train/fine tune with own tif/box. 
Meaning, I have created a folder with my data, and passed it to 
tesstrain.sh via my_box_tiff_dir, but it's not using those, from what I can 
tell, as it creates synth data.
As said above, it's unclear to me if I need to generate the ground-truth 
data as well, do I still need to fiddle/fix the box files, etc.

Sorry if I asked too many questions, I've invested so much time in it, and 
I'm not sure where exactly I'm doing wrong.

I've followed the steps in few of the questions posted in this group, and I 
am getting decent results, however, they are not as good as using the 
traineddata_best on its own.

Steps I've done were:

*Method 1*
1. create box files via lstmbox and fix any mistakes - tesseract img.tif 
img --dpi 600 lstmbox
2. extract lstm from eng.traneddata_best
3. run lstmtraining for fine tuning - lstmtraining --continue from...
4. generate eng.traineddata - lstmtraining stop...

*Method 2*
1. create box files via lstmbox and fix any mistakes - tesseract img.tif 
img --dpi 600 lstmbox
2. create lstmf files - tesseract img.tif img --dpi 600 lstm.train
3. extract unicharset - unicharset_extractor *.box
4. shapeclustering -F font_properties -U unicharset *.tr
5. mftraining -F font_properties -U unicharset -O eng.unicharset *.tr
6. cntraining *.tr
7. rename inttemp, normproto, pffmtable, shapetable
8. combine_tessdata eng.

Thank you for your support and help with my endeavor.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a2a43c7e-c658-4d22-af1c-32dbd1d5b2f4%40googlegroups.com.

Reply via email to