I have a PDF file containing some tabular data.
http://dl.dropbox.com/u/44235928/sample_rotate-0.pdf I have to extract the tabular data from it. I am converting the pdf into image using imagemagic convert utility and then processing those images - convert -rotate 90 -geometry 10000 -depth 8 -density 800 sample.pdf img_800_10000.tif; Since my pdf file consists of only alphabets and numbers, i have created a config file named letters to white-list the alphanumeric characters ( and avoid the junk characters) tessedit_char_whitelist 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz,./-()$@_# I am running tesseract as - tesseract img_800_10000.tif img_800_10000.tif nobatch letters; This way I am getting approximately 80% correct results( correct data ). I repeated the the process by creating the image manually. I opened the pdf file, zoomed it , took the screen-shot, cropped about 15-20 rows of the table and then processed the cropped image with tesseract. I got 100% accuracy. It means that there is something wrong in creating the image from the pdf file using the imagemagic-convert utility. The pdf file seems to be of very good quality, because even after zooming it highly, it is still giving crisp fonts. Can tesseract-ocr directly reads from pdf instead of tif images ? If no- how could I create a good quality image from the pdf to fed tesseract for better accuracy ( I am able to create it manually but I would prefer to do it through some script) . Please suggest the parameter values(density, geometry, depth, monochrome etc) for convert. Thanks Piyush -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

