I am pleased to announce an Open Source project (AMI2) based partly on PDFBox which aims to create semantic documents from PDFs. It's in three parts of which the first (PDF2SVG) uses a lot of the functionality of PDFBox to finally create SVG. The result is a flat document with no further reliance on the original PDF structure and dictionaries.
First to thank members of this list for help and congratulate the developers on a fine product. The overview is at: https://bitbucket.org/petermr/pdf2svg/overview. I also blog on this (look for ami2 in the title, e.g. https://blogs.ch.cam.ac.uk/pmr/2012/11/16/ami2-opencontentmining-ami-analyses-more-pdfs-and-gets-useful-help-from-stackoverflow-and-shapecatcher/- other blogs may or may not be of interest). In essence PDF2SVG tries to: * normalize all x,y, coordinates to a display page/screen * identify all characters with x,y, and Unicode codepoint. These are converted to <svg:text x="" y="">text</svg:text> * identify all paths and convert to <svg:path d="M d d L d d C d d d ..."/> (i.e. move/line/cubic/quad/close) * extract bitmaps. * carry out some (but not total) character equivalencing where glyphs are essentially interchangeable. Also expand ligatures. The aim is to be able to turn STM documents (Scientific Technical Medical) into semantic objects. These documents are widely found in scholarly publications and reports and patents. The two subsequent modules do not directly use PDFBox but may be of interest: * AMI2-SVGPlus converts isolated characters (output of PDF2SVG) into running text, with super- and -subscripts, and paths into higher order primitives (svg:rect, svg:circle, svg:polyline, etc.) It includes a general tool for extracting vectors into graphical plots (e.g. x-y plots with curves and points) * AMI2-SVG2XML converts the results of SVGPlus into scientific objects such as chemical reactions, phylogenetic trees, genome, etc. These last two have been written and are being refactored. The main problem we face (and which will be of interest to PDFBoxers) is the extraction of reliable Unicode codepoints. In favourable cases the PDF document uses PDF-approved fonts (e.g. Helvetica) and Unicode points (BTW I think all science and maths can be done with Unicode). Unfortunately many of the typesetters use non-standard approaches and these include: * sets such as Mathematical-Pi which have no public mapping to Unicode (see my recent question: http://stackoverflow.com/questions/13188587/conversion-of-mathematicalpi-symbol-names-to-unicode). There appear to be 2 main others (Symbol, which maps ASCII characters to Greek letters, for example; and one whose symbols are of the form Cd(dd) - thus C3 is asterisk and C6 is plus-minus. Any idea on where this came from would be valuable! * PDFonts without fontDescriptors * and even PDFonts without fontNames (only basefont). The naming of some fonts is also obscure (e.g. AdvP4C4E74). I suspect these are specific to various typesetting companies but some may be generated on the fly. In the worst case we have only the outline glyphs which we have to translate to Unicode. (this can be done by heuristics but it is not fun - as it's all Open it might be crowdsourceable). So all-in-all it can be difficult to interpret characters and there can be ambiguity. (Would I be right in thinking that it will be difficult for a machine reader - e.g. for unsighted humans - to understand PDFs which had no FontDescriptor and no FontName?) This is an Open collaborative project and we'd be delighted for members of this list to use AMI2 and contribute if they wish. We've set up an issue tracker for comments. I am sure some of you will have faced the same problems and any (even partial) solutions will be useful. PDFD2SVG is beta; the others are being refactored to alpha. PDF2SVG may, of course, be of use in other disciplines - character processing is configurable through external files. Enjoy -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

