Hi Peter,

Thanks.

As suggested, I have gone through the links provided, but unfortunately
could not get to the heuristics to detect the subsuperscripts.

If possible, please attach or provide a link that can publicly be accessed.

Appreciate your help.


On Sat, Mar 29, 2014 at 4:19 AM, Peter Murray-Rust <[email protected]> wrote:

> As Olaf says there is no formal support for sub/superscripts in PDF.
> Generally a lower font size is used and the characters are raised/lowered.
>
> We have written heuristics to detect subsuperscripts in the output of
> PDFBox. See http://bitbucket.org/petermr/ami and
> http://bitbucket.org/petermr/svg2xml-dev. This works well for scholarly
> publishing - it's fairly general but may have to be tweaked for some other
> applications. I have not commonly found Unicode subsuperscripts being used
> - it's normally to use other fontsizes and shift.
>
>
> On Fri, Mar 28, 2014 at 9:47 PM, Olaf Drümmer
> <[email protected]>wrote:
>
> > Two thoughts:
> >
> > - keep track of the baseline and size of characters, if the baseline is
> > slightly shifted (upwards -> superscript, downward -> subscript) and the
> > size is smaller than surrounding characters, it's possibly a superscript
> or
> > subscript character
> >
> > - be aware of the fact that some fonts contain glyphs for superscripts -
> > then baseline and text size would be the same; in such cases you'd have
> to
> > look up via the Unicode code point whether you have encountered a
> > superscript.
> >
> > Olaf
> >
> > Am 28 Mar 2014 um 19:23 schrieb Siva Kumar Ch <[email protected]>:
> >
> > > Hi,
> > >
> > > I am trying to extract text from pdf, and process the text. I have been
> > > successful in extraction, but could not get much benefits out of it as
> > the
> > > extracted text treated the superscripts, usually numbers, as normal
> text.
> > >
> > > A superscript to a word, which is the last word of a sentence, has been
> > > placed after the period(.)
> > >
> > > ex: Word: "test" with superscript "super"
> > > When it appeared at the end of a sentence, has been extracted as -
> > > "test.super"
> > >
> > > Is there any way I can get rid of superscripts?
> > >
> > > --
> > > Br,
> > > Siva.
> >
> >
>
>
> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
>

Reply via email to