Hi, I want to train tesseract with tesstrain, with .tif and .gt.txt pairs. However, the native images are 231DPI scans of old books from 1800s and, I assume, that's pretty low, based on what I read on so many forums, plus, there is an huge amount of text on the scanned images, basically 90% of both side are just text, pictures are really rare. I tried a lot of the methods to increase quality, IM script, and some projects from GH, with little to no improvement. Image Magick's resample seems to have the most impact. I tried 300, 400, 600, 800 and 1000 DPI, with "sweet spot" being 800 based on the results, since there's a regression on 1000 and below 800 I saw some errors like line could not be read, something like that. I used tesseract's hocr output, than hocr-tools to generate segmentation pairs. So here;'s my dilemma.. -What range is the best for tesseract, is 800 too much? -If I upscale the initial image from which I make hocr than segment it, should I then upscale all my images that I will later use my trained model on? -Does ground truth need to have some order? When I do ground truth for one segmented file it goes, for example, from 00001 to 00999 and another with one 0 less like 0001 to 0999, then I just put then into the same folder and that's okay?
Hopefully that makes sense and my English is not that bad. Apologies if I sound confusing, it's kinda hard to explain. I'll add any additional info if I missed it. Feel free to ask me anything and thanks in advance. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b0099659-de84-4e0a-9330-20de1e91cbd3n%40googlegroups.com.