I am very involved in trying to extract meaning from PDFs and add some
comments. Everything Eliot says is correct. The problem is hard and very
dependent on the source. An airline boarding pass has a very different
layout from a thesis. But I am optimistic that a limited solution can be
found in the academic/scholarly-publishing domain (which I think covers
Thomas' concern.). I am also very interested in creating an Open source
community to tackle this problem, using PDFBox as the initial tool. So
although it's "me" up to now I'd like it to be "us".

The problem falls into at least 3 modules:

* what are the primitives in the document (characters, paths, images)?
 I am tackling this in http://www.bitbucket.org/petermr/pdf2svg. It is an
essential first step. I have analysed about 2000 PDFs from bioscience
articles and about 400 different publishers and found the following:
  - there is a very large variety of fonts, usually undocumented. I have
built ad hoc conversion tables for nearly 100 of these. The tables are not
complete but cover the commenest characters.
 - a small proportion of articles (< 1%) have problem characters. Some of
these can be recovered with heuristics.

I have tried two examples from Goettingen (
http://www.univerlag.uni-goettingen.de/content/list.php?notback=1&details=isbn-978-3-86395-092-7-
36 MB of astrophysics, and
http://www.gojil.org/ Goettingen's Journal of International Law). If you
have other Open eBooks let me know). Both convert without real problems.
The first has some very large pages (ca 20 Mb of SVG because they are very
rich diagrams and contaiin about 10000 points for stars (in a single
diagram). This is part of my motivation I can often extract data from PDFs.

This is a reasonably running system.

* Can we create a structured document?
 I am tackling your problem in http://bitbucket.org/petermr/svgplus - in
progress. Here the method may depends on the domain. Law and astrophysics
use different layouts - for example law has references as footnotes on each
page while astro has a separate Bibliography. Some domains have references
and comments as floats. My comments from hereon relate to science.

I use the following strategy (which is not novel)
- break the page into whitespace-separated chunks. I normally do 3 passes
(horizontal, vertical, horizontal). This creates a tree of chunk which
manages most things found in scientific PDFs.
- analyze super- and sub-scripts (in PDF these are simply smaller
characters displaced from running text).
- join lines by heuristics.
- split paras by heuristics
(These last two are helped by indents, change of style, short sentences,
etc. There is no golden rule.)
- identify figures and tables
- glue the running text together (it's usually got a single font)
- glue the pages together.

I have been through this in prototype but and now refactoring so not much
works.

* identify the "meaning". We now need to identify what specific chunks
"mean". A subscript might be part of a chemical formula or an index in
maths. Italics might indicate a species. Data can be extracted from
diagrams and tables. We can interpret chemical structure diagrams,
phylogenetic trees and xy-plots and there is no reason why this couldn't
extend to electronic circuits, flow diagrams, organization charts, etc. I
think domain-based crowdsourcing will be valuable here.

Overall it is a very hard tedious problem with an invetiable small
percentage of errors. But the rewards are very large.

P.

Note, of course, that creating PDF from Word, LaTeX, etc. usually
*destroys* information. I wish university libraries didn't do it. But
that's out of scope here...

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Reply via email to