We do a great deal of this and have created two downstream packages which consume the output of PDFBox:
* https://bitbucket.org/petermr/pdf2svg/ (which translates the PDF into SVG) * https://bitbucket.org/petermr/svg2xml (which tries to convert the SVG into high-level constructs) There are roughly 3 outputs from PDFBox that relate to the viewable page (we deliberately ignore all metadata, dictionaries, etc as it is likely to be inconsistent) * characters either through codepoints (often not Unicode, unfortunately) or though pixel-based glyphs * bitmaps (raster) as Eliot mentions * graphics paths (move, line, quadratic and cubic bezier). It is possible for all of these to occur in the same area. However in many instances the "text" and the "graphics" are separated by whitespace. (We cannot rely on the order of primitives). We can then use whitespace heuristics to separate this into "text" , "graphics" and "pixel images". (Note, however, that text could contain small pixel images for characters, amd also small paths for underlines, etc.). Assuming that you have "clean" graphics - such as plots - it is possible with a great deal of work to extract a reasonable guess at the original primitives. (For example there is no "circle" or "rectangle" in PDF, only paths). It depends on what your material is, how it was produced, what the primitives are, etc. You are very welcome to try our software which is all Apache2 licensed. On Fri, Mar 20, 2015 at 1:43 PM, Warren Gallagher < [email protected]> wrote: > > > Greetings, > > Is there a means to determine if a page contains: > > * vector graphics > * raster graphics (and what format) > > Regards, > > WARREN GALLAGHER - CTO > > [email protected] > > M: 613-791-4987 W: 613-262-2601 Advance Property eXposure Canada Inc. > 1755 Woodward Drive, Suite 101, Ottawa, Ontario K2C 0P9 APXConsult.com > [1] > > Links: > ------ > [1] http://apxconsult.com > -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

