ANN: AMI2-PDF2SVG conversion of PDF to semantic characters and graphics

Peter Murray-Rust Fri, 16 Nov 2012 16:43:28 -0800

I am pleased to announce an Open Source project (AMI2) based partly on
PDFBox which aims to create semantic documents from PDFs. It's in three
parts of which the first (PDF2SVG) uses a lot of the functionality of
PDFBox to finally create SVG. The result is a flat document with no further
reliance on the original PDF structure and dictionaries.

First to thank members of this list for help and congratulate the
developers on a fine product.

The overview is at: https://bitbucket.org/petermr/pdf2svg/overview. I also
blog on this (look for ami2 in the title, e.g.
https://blogs.ch.cam.ac.uk/pmr/2012/11/16/ami2-opencontentmining-ami-analyses-more-pdfs-and-gets-useful-help-from-stackoverflow-and-shapecatcher/-
other blogs may or may not be of interest).

In essence PDF2SVG tries to:
* normalize all x,y, coordinates to a display page/screen
* identify all characters with x,y, and Unicode codepoint. These are
converted to <svg:text x="" y="">text</svg:text>
* identify all paths and convert to <svg:path d="M d d L d d C d d d ..."/>
(i.e. move/line/cubic/quad/close)
* extract bitmaps.
* carry out some (but not total) character equivalencing where glyphs are
essentially interchangeable. Also expand ligatures.

The aim is to be able to turn STM documents (Scientific Technical Medical)
into semantic objects. These documents are widely found in scholarly
publications and reports and patents. The two subsequent modules do not
directly use PDFBox but may be of interest:
* AMI2-SVGPlus converts isolated characters (output of PDF2SVG) into
running text, with super- and -subscripts, and paths into higher order
primitives (svg:rect, svg:circle, svg:polyline, etc.) It includes a general
tool for extracting vectors into graphical plots (e.g. x-y plots with
curves and points)
* AMI2-SVG2XML converts the results of SVGPlus into scientific objects such
as chemical reactions, phylogenetic trees, genome, etc.
These last two have been written and are being refactored.

The main problem we face (and which will be of interest to PDFBoxers) is
the extraction of reliable Unicode codepoints. In favourable cases the PDF
document uses PDF-approved fonts (e.g. Helvetica) and Unicode points (BTW I
think all science and maths can be done with Unicode). Unfortunately many
of the typesetters use non-standard approaches and these include:
* sets such as Mathematical-Pi which have no public mapping to Unicode (see
my recent question:
http://stackoverflow.com/questions/13188587/conversion-of-mathematicalpi-symbol-names-to-unicode).
There appear to be 2 main others (Symbol, which maps ASCII characters
to
Greek letters, for example; and one whose symbols are of the form Cd(dd) -
thus C3 is asterisk and C6 is plus-minus. Any idea on where this came from
would be valuable!
* PDFonts without fontDescriptors
* and even PDFonts without fontNames (only basefont).

The naming of some fonts is also obscure (e.g. AdvP4C4E74). I suspect these
are specific to various typesetting companies but some may be generated on
the fly. In the worst case we have only the outline glyphs which we have to
translate to Unicode. (this can be done by heuristics but it is not fun -
as it's all Open it might be crowdsourceable). So all-in-all it can be
difficult to interpret characters and there can be ambiguity. (Would I be
right in thinking that it will be difficult for a machine reader - e.g. for
unsighted humans - to understand PDFs which had no FontDescriptor and no
FontName?)

This is an Open collaborative project and we'd be delighted for members of
this list to use AMI2 and contribute if they wish. We've set up an issue
tracker for comments. I am sure some of you will have faced the same
problems and any (even partial) solutions will be useful.

PDFD2SVG is beta; the others are being refactored to alpha.

PDF2SVG may, of course, be of use in other disciplines - character
processing is configurable through external files.

Enjoy
--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

ANN: AMI2-PDF2SVG conversion of PDF to semantic characters and graphics

Reply via email to