On Tue, May 7, 2013 at 8:39 AM, Maruan Sahyoun <[email protected]>wrote:
> Hi, > > you can either extract to HTML (call Extract Text with the -html option > for example) or create you own logic. You can take a look at > org.apache.pdfbox.util.PDFText2HTML as a starting point. > > There is also a project to convert PDFtoSVG using PDFBox as a basis which > might also serve as an example (https://bitbucket.org/petermr/pdftosvg) > > Thanks very much Maruan, I was about to announce this is a few days time. Essentially there are two parts: 1. https://bitbucket.org/petermr/pdf2svg<https://bitbucket.org/petermr/pdftosvg>(sic) and the developer version https://bitbucket.org/petermr/pdf2svg-dev These convert the PDFBox output to SVG (Unicode+UTF-8) , with particular emphasis on heuristics for uncommon and undocumented fonts. The main architecture is stable. The *-dev version is being developed and will concentrate mainly on codepointSets for fonts. This is an area where contributions are straightforward, welcomed and incremental. Maths fonts (esp CMSY and other CM, Mathematical Pi, etc.) are a finite (if tedious) task. Later we would like to fit machine-learning for glyphs where the font is unknown. The SVG has three components: * individual chars svg:text with x,y,char as well as font, colour, etc. * paths (svg:path) for lines, circles, boxes, etc. * svg:image for bitmaps. 2. converting the SVG to HTML and XML https://bitbucket.org/petermr/svg2xml-dev (only the developer version so far). This makes good progress. It is highly heuristic with edge cases and has been developed on technical documents, especially academic articles and theses in STEM subjects. It can make reasonable sense of an average paper including analysing pagination and numbering. It separates tables and figures, and concatenates the main text. Prototyped: * extraction of rectangular tables * reconstruction of vector graphic objects in figures (where there is embedded PS) * sub- and superscripts and very simple "formulae" * sectioning Most of the remaining work will be edge cases (e.g. bad tables) Not yet attempted * footnotes * inline lists , including bibliography * deconstruction of bitmapped graphics (text and paths). * title pages (we are optimistic that analysis of several articles will work) It's at developer level - we'd love to have your input - it's a communal OpenSource project. Not ready for general use , especially why people don't understand there is an inevitable error rate (although small). We'd be delighted to hear from anyone needing this but at present you need to be able to understand running java and accept the rough edges. Best P. -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

