Improving OCR for large batches of documents

Impix Wed, 19 Oct 2011 18:40:05 -0700

Hey all,

I use Tesseract to automatically OCR batches of TIFF files, but the
accuracy is pretty much hit or miss. I've been using ImageMagick to
convert from PDF to TIFF, and something like "convert -density 380"
will produce great OCR results for one file, whereas the same will not
work well for another scan. How do I work out what value would work
well alongside "-density"? Is there some data I can use from the
identify command to help me calculate the ideal value, i.e. anything
from: http://pastie.org/2726564 ?


And what else should I try doing to the image to improve the results
from Tesseract? At the moment I'm just using ImageMagick and was
thinking of playing with parameters that increase brightness and
contrast, turning the alpha layer off, etc. I'm open to any other
tools and ideas if they're gonna help...

Thanks

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Improving OCR for large batches of documents

Reply via email to