I have been doing a lot of graphical extraction of scientific "images" , but in general there is no algorithmic way.( I'd be happy to see if there is an overlap of our interests.)
To simplify: The PDF stream consists of bitmaps (images), glyphs (characters with code points) and paths (a mixture of Move, Line, Quadratic and Cubic curves, with Close(Z)). I tend to use "image" for bitmaps and "plots", "diagrams" or "graphics" for non-bitmap graphics. A "plot" generally consists of characters, and paths (and sometimes small images/bitmaps). But paths can occur anywhere and a diagram is only defined by convention - either a whitespace border or a rectangular path surround. But characters can be created by paths (cursive glyphs) which are difficult to interpret, and small paths can be embedded within runs of glyphs. I convert these to SVG. In practice I attempt to identify diagrams by whitespace surrounds, borders, and formal identification such as "Figure 2." But some diagrams don't have captions (e.g. chemical reaction schemes. In other places paths are used as page decoration (e.g. think lines, publisher icons, etc.). So simple answer there is no formal way, but there are heuristics. I am making useful progress with this and can extract certain types of diagrams into SVG. see https://github.com/petermr/normami (warning it's complex and mostly created as a library). On Tue, Mar 5, 2019 at 10:34 PM European Neuroscience Center < mnachev.nscenter...@gmail.com> wrote: > Hi, > > What is the way to extract an embedded image, which is in SVG format from > an PDF file using PDFBox? > > If there is no such option, how to determine from where the embedded SVG > image starts and extract this XML part of the PDF file? > > > Regards, > Miro. > -- Peter Murray-Rust Reader Emeritus in Molecular Informatics Unilever Centre, Dept. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069