I have been doing a lot of graphical extraction of scientific "images" ,
but in general there is no algorithmic way.( I'd be happy to see if there
is an overlap of our interests.)

To simplify: The PDF stream consists of bitmaps (images), glyphs
(characters with code points) and paths (a mixture of Move, Line, Quadratic
and Cubic curves, with Close(Z)). I tend to use "image" for bitmaps and
"plots", "diagrams" or "graphics" for non-bitmap graphics. A "plot"
generally consists of characters, and paths (and sometimes small
images/bitmaps). But paths can occur anywhere and a diagram is only defined
by convention - either a whitespace border or a rectangular path surround.
But characters can be created by paths (cursive glyphs) which are difficult
to interpret, and small paths can be embedded within runs of glyphs. I
convert these to SVG.

In practice I attempt to identify diagrams by whitespace surrounds,
borders, and formal identification such as "Figure 2." But some diagrams
don't have captions (e.g. chemical reaction schemes. In other places paths
are used as page decoration (e.g. think lines, publisher icons, etc.).

So simple answer there is no formal way, but there are heuristics. I am
making useful progress with this and can extract certain types of diagrams
into SVG.

see https://github.com/petermr/normami (warning it's complex and mostly
created as a library).


On Tue, Mar 5, 2019 at 10:34 PM European Neuroscience Center <
mnachev.nscenter...@gmail.com> wrote:

> Hi,
>
> What is the way to extract an embedded image, which is in SVG format from
> an PDF file using PDFBox?
>
> If there is no such option, how to determine from where the embedded SVG
> image starts and extract this XML part of the PDF file?
>
>
> Regards,
> Miro.
>


-- 
Peter Murray-Rust
Reader Emeritus in Molecular Informatics
Unilever Centre, Dept. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Reply via email to