I am very involved in trying to extract meaning from PDFs and add some comments. Everything Eliot says is correct. The problem is hard and very dependent on the source. An airline boarding pass has a very different layout from a thesis. But I am optimistic that a limited solution can be found in the academic/scholarly-publishing domain (which I think covers Thomas' concern.). I am also very interested in creating an Open source community to tackle this problem, using PDFBox as the initial tool. So although it's "me" up to now I'd like it to be "us".
The problem falls into at least 3 modules: * what are the primitives in the document (characters, paths, images)? I am tackling this in http://www.bitbucket.org/petermr/pdf2svg. It is an essential first step. I have analysed about 2000 PDFs from bioscience articles and about 400 different publishers and found the following: - there is a very large variety of fonts, usually undocumented. I have built ad hoc conversion tables for nearly 100 of these. The tables are not complete but cover the commenest characters. - a small proportion of articles (< 1%) have problem characters. Some of these can be recovered with heuristics. I have tried two examples from Goettingen ( http://www.univerlag.uni-goettingen.de/content/list.php?notback=1&details=isbn-978-3-86395-092-7- 36 MB of astrophysics, and http://www.gojil.org/ Goettingen's Journal of International Law). If you have other Open eBooks let me know). Both convert without real problems. The first has some very large pages (ca 20 Mb of SVG because they are very rich diagrams and contaiin about 10000 points for stars (in a single diagram). This is part of my motivation I can often extract data from PDFs. This is a reasonably running system. * Can we create a structured document? I am tackling your problem in http://bitbucket.org/petermr/svgplus - in progress. Here the method may depends on the domain. Law and astrophysics use different layouts - for example law has references as footnotes on each page while astro has a separate Bibliography. Some domains have references and comments as floats. My comments from hereon relate to science. I use the following strategy (which is not novel) - break the page into whitespace-separated chunks. I normally do 3 passes (horizontal, vertical, horizontal). This creates a tree of chunk which manages most things found in scientific PDFs. - analyze super- and sub-scripts (in PDF these are simply smaller characters displaced from running text). - join lines by heuristics. - split paras by heuristics (These last two are helped by indents, change of style, short sentences, etc. There is no golden rule.) - identify figures and tables - glue the running text together (it's usually got a single font) - glue the pages together. I have been through this in prototype but and now refactoring so not much works. * identify the "meaning". We now need to identify what specific chunks "mean". A subscript might be part of a chemical formula or an index in maths. Italics might indicate a species. Data can be extracted from diagrams and tables. We can interpret chemical structure diagrams, phylogenetic trees and xy-plots and there is no reason why this couldn't extend to electronic circuits, flow diagrams, organization charts, etc. I think domain-based crowdsourcing will be valuable here. Overall it is a very hard tedious problem with an invetiable small percentage of errors. But the rewards are very large. P. Note, of course, that creating PDF from Word, LaTeX, etc. usually *destroys* information. I wish university libraries didn't do it. But that's out of scope here... -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

