Hi,

I want to train tesseract with tesstrain, with .tif and .gt.txt pairs. 
However, the native images are 231DPI scans of old books from 1800s and, I 
assume, that's pretty low, based on what I read on so many forums, plus, 
there is an huge amount of text on the scanned images, basically 90% of 
both side are just text, pictures are really rare. I tried a lot of the 
methods to increase quality, IM script, and some projects from GH, with 
little to no improvement. Image Magick's resample seems to have the most 
impact. I tried 300, 400, 600, 800 and 1000 DPI, with "sweet spot" being 
800 based on the results, since there's a regression on 1000 and below 800 
I saw some errors like line could not be read, something like that. I used 
tesseract's hocr output, than hocr-tools to generate segmentation pairs.
So here;'s my dilemma..
 -What range is the best for tesseract, is 800 too much?
 -If I upscale the initial image from which I make hocr than segment it, 
should I then upscale all my images that I will later use my trained model 
on?
 -Does ground truth need to have some order? When I do ground truth for 
one segmented file it goes, for example, from 00001 to 00999 and another 
with one 0 less like 0001 to 0999, then I just put then into the same 
folder and that's okay?

Hopefully that makes sense and my English is not that bad. Apologies if I 
sound confusing, it's kinda hard to explain. I'll add any additional info 
if I missed it.
Feel free to ask me anything and thanks in advance.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/b0099659-de84-4e0a-9330-20de1e91cbd3n%40googlegroups.com.

Reply via email to