Hi Nikhil and all, I had the best results with a python tool called pdf-table-extract:
https://github.com/ashima/pdf-table-extract you have to tweak the parameters a bit, but then it rather nicely extracts the coordinates of each cell (defined as something surrounded by a black rectangle) which you can then feed into ghostscript or some such to extract the image (gs is faster than pdftoppm IMHO). In most cases pdf-table-extract -i FILE -p PAGE -r 300 -l 0.7 -t cells_xml worked nicely for electoral rolls... Just my 5 cents, Raphael On 29.09.2015 21:19, Nikhil VJ wrote: > Hi Raphael, > > Thanks for sharing about Tesseract: it always helps to know what's in > the engines ~:) > > I wish we had a way of OCR'ing tabular documents. Tabula's interface > combined with OCR. > I created a feature request on Tabula for this : > https://github.com/tabulapdf/tabula/issues/409 > Let's hope it gets some love! Please +1 it! > > Siddharth, you should share at least a one page PDF sample of what > you're working with, we'll be able to see which way is best for what > you've got. > > If one goes the OCR way, we might need to convert the target PDF to > image format. There are quite some online sites for doing that, but it > gets tricky when using non-English script. If you're on a linux OS, then > *pdftoppm* is a good command line tool to use. > > Sample command: pdftoppm -rx 200 -ry 200 -png b.pdf b > (200 sets DPI.. I found this to be best with the docs I was doing) -- Dr. Raphael Susewind | Political anthropologist, Associate CSASP Oxford Snail Mail | Melanchthonstr. 4a, 33615 Bielefeld, Germany Web & Twitter | http://www.raphael-susewind.de | @RaphaelSusewind Please do consider http://www.gnupg.org for encryption (key id 10AEE42F) -- Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org --- You received this message because you are subscribed to the Google Groups "datameet" group. To unsubscribe from this group and stop receiving emails from it, send an email to datameet+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.