Re: Text Extraction with Formatting

Peter Murray-Rust Tue, 07 May 2013 12:45:38 -0700

On Tue, May 7, 2013 at 8:39 AM, Maruan Sahyoun <[email protected]>wrote:


> Hi,
>
> you can either extract to HTML (call Extract Text with the -html option
> for example) or create you own logic. You can take a look at
>  org.apache.pdfbox.util.PDFText2HTML as a starting point.
>
> There is also a project to convert PDFtoSVG using PDFBox as a basis which
> might also serve as an example (https://bitbucket.org/petermr/pdftosvg)
>
>
Thanks very much Maruan,
I was about to announce this is a few days time. Essentially there are two
parts:

1. 
https://bitbucket.org/petermr/pdf2svg<https://bitbucket.org/petermr/pdftosvg>(sic)
and the developer version
https://bitbucket.org/petermr/pdf2svg-dev

These convert the PDFBox output to SVG (Unicode+UTF-8) , with particular
emphasis on heuristics for uncommon and undocumented fonts. The main
architecture is stable. The *-dev version  is being developed and will
concentrate mainly on codepointSets for fonts. This is an area where
contributions are straightforward, welcomed and incremental. Maths fonts
(esp CMSY and other CM, Mathematical Pi, etc.) are a finite (if tedious)
task. Later we would like to fit machine-learning for glyphs where the font
is unknown.

The SVG has three components:
* individual chars svg:text with x,y,char as well as font, colour, etc.
* paths (svg:path) for lines, circles, boxes, etc.
* svg:image for bitmaps.

2.  converting the SVG to HTML and XML
https://bitbucket.org/petermr/svg2xml-dev (only the developer version so
far). This makes good progress. It is highly heuristic with edge cases  and
has been developed on technical documents, especially academic articles and
theses in STEM subjects. It can make reasonable sense of an average paper
including analysing pagination and numbering. It separates tables and
figures, and concatenates the main text.

Prototyped:
* extraction of rectangular tables
* reconstruction of vector graphic objects in figures (where there is
embedded PS)
* sub- and superscripts and very simple "formulae"
* sectioning

Most of the remaining work will be edge cases (e.g. bad tables)

Not yet attempted
* footnotes
* inline lists , including bibliography
* deconstruction of bitmapped graphics (text and paths).
* title pages (we are optimistic that analysis of several articles will
work)

It's at developer level - we'd love to have your input - it's a communal
OpenSource project. Not ready for general use , especially why people don't
understand there is an inevitable error rate (although small).

We'd be delighted to hear from anyone needing this but at present you need
to be able to understand running java and accept the rough edges.

Best

P.


-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Re: Text Extraction with Formatting

Reply via email to