Hi Nikhil and all,

I had the best results with a python tool called pdf-table-extract:

https://github.com/ashima/pdf-table-extract

you have to tweak the parameters a bit, but then it rather nicely
extracts the coordinates of each cell (defined as something surrounded
by a black rectangle) which you can then feed into ghostscript or some
such to extract the image (gs is faster than pdftoppm IMHO). In most
cases pdf-table-extract -i FILE -p PAGE -r 300 -l 0.7 -t cells_xml
worked nicely for electoral rolls...

Just my 5 cents,

Raphael

On 29.09.2015 21:19, Nikhil VJ wrote:

> Hi Raphael,
> 
> Thanks for sharing about Tesseract: it always helps to know what's in
> the engines ~:)
> 
> I wish we had a way of OCR'ing tabular documents. Tabula's interface
> combined with OCR.
> I created a feature request on Tabula for this :
> https://github.com/tabulapdf/tabula/issues/409
> Let's hope it gets some love! Please +1 it!
> 
> Siddharth, you should share at least a one page PDF sample of what
> you're working with, we'll be able to see which way is best for what
> you've got.
> 
> If one goes the OCR way, we might need to convert the target PDF to
> image format. There are quite some online sites for doing that, but it
> gets tricky when using non-English script. If you're on a linux OS, then
> *pdftoppm* is a good command line tool to use.
> 
> Sample command: pdftoppm -rx 200 -ry 200 -png b.pdf b
> (200 sets DPI.. I found this to be best with the docs I was doing)

-- 
Dr. Raphael Susewind | Political anthropologist, Associate CSASP Oxford
          Snail Mail | Melanchthonstr. 4a, 33615 Bielefeld, Germany
       Web & Twitter | http://www.raphael-susewind.de | @RaphaelSusewind

Please do consider http://www.gnupg.org for encryption (key id 10AEE42F)

-- 
Datameet is a community of Data Science enthusiasts in India. Know more about 
us by visiting http://datameet.org
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to datameet+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to