Very cool project! I did not see any EULA on this declaring a GPL or similar style license. What license are you using? I would like to introduce this work to some people.
Thank you for sharing! Duane Nickull *********************************** Technoracle Advanced Systems Inc. Consulting and Contracting; Proven Results! i. Neo4J, PDF, Java, LiveCycle ES, Flex, AIR, CQ5 & Mobile b. http://technoracle.blogspot.com t. @duanechaos "Don't fear the Graph! Embrace Neo4J" On 2012-11-16 4:42 PM, "Peter Murray-Rust" <[email protected]> wrote: >I am pleased to announce an Open Source project (AMI2) based partly on >PDFBox which aims to create semantic documents from PDFs. It's in three >parts of which the first (PDF2SVG) uses a lot of the functionality of >PDFBox to finally create SVG. The result is a flat document with no >further >reliance on the original PDF structure and dictionaries. > >First to thank members of this list for help and congratulate the >developers on a fine product. > >The overview is at: https://bitbucket.org/petermr/pdf2svg/overview. I also >blog on this (look for ami2 in the title, e.g. >https://blogs.ch.cam.ac.uk/pmr/2012/11/16/ami2-opencontentmining-ami-analy >ses-more-pdfs-and-gets-useful-help-from-stackoverflow-and-shapecatcher/- >other blogs may or may not be of interest). > >In essence PDF2SVG tries to: >* normalize all x,y, coordinates to a display page/screen >* identify all characters with x,y, and Unicode codepoint. These are >converted to <svg:text x="" y="">text</svg:text> >* identify all paths and convert to <svg:path d="M d d L d d C d d d >..."/> >(i.e. move/line/cubic/quad/close) >* extract bitmaps. >* carry out some (but not total) character equivalencing where glyphs are >essentially interchangeable. Also expand ligatures. > >The aim is to be able to turn STM documents (Scientific Technical Medical) >into semantic objects. These documents are widely found in scholarly >publications and reports and patents. The two subsequent modules do not >directly use PDFBox but may be of interest: >* AMI2-SVGPlus converts isolated characters (output of PDF2SVG) into >running text, with super- and -subscripts, and paths into higher order >primitives (svg:rect, svg:circle, svg:polyline, etc.) It includes a >general >tool for extracting vectors into graphical plots (e.g. x-y plots with >curves and points) >* AMI2-SVG2XML converts the results of SVGPlus into scientific objects >such >as chemical reactions, phylogenetic trees, genome, etc. >These last two have been written and are being refactored. > >The main problem we face (and which will be of interest to PDFBoxers) is >the extraction of reliable Unicode codepoints. In favourable cases the PDF >document uses PDF-approved fonts (e.g. Helvetica) and Unicode points (BTW >I >think all science and maths can be done with Unicode). Unfortunately many >of the typesetters use non-standard approaches and these include: >* sets such as Mathematical-Pi which have no public mapping to Unicode >(see >my recent question: >http://stackoverflow.com/questions/13188587/conversion-of-mathematicalpi-s >ymbol-names-to-unicode). >There appear to be 2 main others (Symbol, which maps ASCII characters >to >Greek letters, for example; and one whose symbols are of the form Cd(dd) - >thus C3 is asterisk and C6 is plus-minus. Any idea on where this came from >would be valuable! >* PDFonts without fontDescriptors >* and even PDFonts without fontNames (only basefont). > >The naming of some fonts is also obscure (e.g. AdvP4C4E74). I suspect >these >are specific to various typesetting companies but some may be generated on >the fly. In the worst case we have only the outline glyphs which we have >to >translate to Unicode. (this can be done by heuristics but it is not fun - >as it's all Open it might be crowdsourceable). So all-in-all it can be >difficult to interpret characters and there can be ambiguity. (Would I be >right in thinking that it will be difficult for a machine reader - e.g. >for >unsighted humans - to understand PDFs which had no FontDescriptor and no >FontName?) > >This is an Open collaborative project and we'd be delighted for members of >this list to use AMI2 and contribute if they wish. We've set up an issue >tracker for comments. I am sure some of you will have faced the same >problems and any (even partial) solutions will be useful. > >PDFD2SVG is beta; the others are being refactored to alpha. > >PDF2SVG may, of course, be of use in other disciplines - character >processing is configurable through external files. > >Enjoy >-- >Peter Murray-Rust >Reader in Molecular Informatics >Unilever Centre, Dep. Of Chemistry >University of Cambridge >CB2 1EW, UK >+44-1223-763069

