Hi, I used pdftk (pdf toolkit) before. A quick glance at the features seems to tell that it does *not* support what you are looking for, but it may nonetheless be a useful starting point: http://www.accesspdf.com/pdftk/ . The nice thing is that it's a command-line tool.
Cheers!! Albert-Jan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In the face of ambiguity, refuse the temptation to guess. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ --- On Sat, 1/9/10, Barry Rowlingson <b.rowling...@lancaster.ac.uk> wrote: From: Barry Rowlingson <b.rowling...@lancaster.ac.uk> Subject: Re: [R] parsing pdf files To: "David Kane" <d...@kanecap.com> Cc: r-help@r-project.org Date: Saturday, January 9, 2010, 2:47 PM On Sat, Jan 9, 2010 at 1:11 PM, David Kane <d...@kanecap.com> wrote: > I have a pdf file that I would like to parse into R: > > http://www.williams.edu/Registrar/geninfo/faculty.pdf > > For now, I open the file in Acrobat by hand, then save it "as text" > and then use readLines(). That works fine but a) I am concerned that > some information may be lost and b) I may be doing this a lot, so I > would rather have R grab the information from the pdf file directly. > > So: is there something like readPDF() for R? What could it do that saving as text from Acrobat couldn't do? Here's the problem - PDF is a page description format, it's not designed to be read back. There's no guarantee that the letters on the page appear in the PDF in the same order as they seem on the page. The page could have all the letter 'a's, then the 'b's and so on, positioned in their right places to make up words. To reconstruct the words you'd have to spot where the letters were being placed, and then figure out the breaks and make up the words. Good luck making the sentences. Most PDFs aren't that perverse, and you can often get sensible text out of them. But then you run into font encodings and graphics and column layouts and stuff. Any effort put into writing a readPDF() would have to be redone every time someone tried to read a PDF :) On Linux/Unix there's a bunch of command line tools for trying to do this kind of thing with PDF files - see pdftotext for example. You could run that from R with system() and then read the text with readLines. But there's absolutely no guarantees this will work. Windows/Mac versions (did you say what your platform was?) of the command line tools may be available. The real answer is to get the original data in a format with some kind of semantics that R could read, for example a CSV or some nice XML format. Barry -- blog: http://geospaced.blogspot.com/ web: http://www.maths.lancs.ac.uk/~rowlings web: http://www.rowlingson.com/ twitter: http://twitter.com/geospacedman pics: http://www.flickr.com/photos/spacedman ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.