Re: Eliminating super scripts while extracting text from pdf

Peter Murray-Rust Sat, 29 Mar 2014 01:20:07 -0700

As Olaf says there is no formal support for sub/superscripts in PDF.
Generally a lower font size is used and the characters are raised/lowered.


We have written heuristics to detect subsuperscripts in the output of
PDFBox. See http://bitbucket.org/petermr/ami and
http://bitbucket.org/petermr/svg2xml-dev. This works well for scholarly
publishing - it's fairly general but may have to be tweaked for some other
applications. I have not commonly found Unicode subsuperscripts being used
- it's normally to use other fontsizes and shift.


On Fri, Mar 28, 2014 at 9:47 PM, Olaf Drümmer
<[email protected]>wrote:

> Two thoughts:
>
> - keep track of the baseline and size of characters, if the baseline is
> slightly shifted (upwards -> superscript, downward -> subscript) and the
> size is smaller than surrounding characters, it's possibly a superscript or
> subscript character
>
> - be aware of the fact that some fonts contain glyphs for superscripts -
> then baseline and text size would be the same; in such cases you'd have to
> look up via the Unicode code point whether you have encountered a
> superscript.
>
> Olaf
>
> Am 28 Mar 2014 um 19:23 schrieb Siva Kumar Ch <[email protected]>:
>
> > Hi,
> >
> > I am trying to extract text from pdf, and process the text. I have been
> > successful in extraction, but could not get much benefits out of it as
> the
> > extracted text treated the superscripts, usually numbers, as normal text.
> >
> > A superscript to a word, which is the last word of a sentence, has been
> > placed after the period(.)
> >
> > ex: Word: "test" with superscript "super"
> > When it appeared at the end of a sentence, has been extracted as -
> > "test.super"
> >
> > Is there any way I can get rid of superscripts?
> >
> > --
> > Br,
> > Siva.
>
>


-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Re: Eliminating super scripts while extracting text from pdf

Reply via email to