On Fri, Oct 10, 2014 at 2:33 PM, Maruan Sahyoun <[email protected]> wrote:
> Hi Marc, > > text and image extraction is one of the regular use cases. Keeping the > formatting is also possible but there is a different concept behind the PDF > format and text processing. E.g. what is a paragraph within a text > processor might be individually placed characters (glyphs) within a PDF > file. You might want to look into PDFStreamEngine and it’s subclasses how > to process graphics and text information of a PDF. > > Another sample is PDF2SVG which uses PDFBox [ > https://bitbucket.org/petermr/pdf2svg/wiki/Home] > Thanks for the link. see also http://www.contentmine.org The PDF2SVG project is active and the first part of a pipeline which includes: PDF -> (SVG, PNG) -> (SVG, XHTML, PNG) -> (SVG, XHTML, SVG) (where bitmaps have been converted to SVG) -> (Shapes, Text) -> Semantic Documents -> Science We are now able to take (most) PDFs and extract primitives which are heuristically combined to create Characters and Paths, which are combined to Shapes and Text. This is structured into XHTML, along with sub/superscripts and styling (italics). In favourable cases we can extract semantic science (currently evolutionary trees from pixel diagrams in PDFs, and chemical reactions also from pixels in PDFs). We have to do a significant amount of OCR because (a) diagrams have characters in pixels and (b) scientific publishers use the worst-ever non-compliant Fonts in their PDFs. This means we have to guess the character / codePoint from the outline glyph or pixel map. Some of this is good beta, some is raw alpha. We'd be delighted if anyone is interested in hacking pixels or glyph outlines in PDFs - it's painful but you get a warm glow of having helped the human race. Same goes for tables and document structuring... BR P -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

