Re: Eliminating super scripts while extracting text from pdf

Peter Murray-Rust Mon, 31 Mar 2014 15:17:29 -0700

The second link (on Bitbucket) has a download and you can download and run
the code. It's all under Maven. The documentation isn't great. You may
should find some PDF example documents  with subscripts in.



On Mon, Mar 31, 2014 at 10:17 PM, Siva Kumar <[email protected]> wrote:

> Hi Peter,
>
> Thanks.
>
> As suggested, I have gone through the links provided, but unfortunately
> could not get to the heuristics to detect the subsuperscripts.
>
> If possible, please attach or provide a link that can publicly be accessed.
>
> Appreciate your help.
>
>
> On Sat, Mar 29, 2014 at 4:19 AM, Peter Murray-Rust <[email protected]>
> wrote:
>
> > As Olaf says there is no formal support for sub/superscripts in PDF.
> > Generally a lower font size is used and the characters are
> raised/lowered.
> >
> > We have written heuristics to detect subsuperscripts in the output of
> > PDFBox. See http://bitbucket.org/petermr/ami and
> > http://bitbucket.org/petermr/svg2xml-dev. This works well for scholarly
> > publishing - it's fairly general but may have to be tweaked for some
> other
> > applications. I have not commonly found Unicode subsuperscripts being
> used
> > - it's normally to use other fontsizes and shift.
> >
> >
> > On Fri, Mar 28, 2014 at 9:47 PM, Olaf Drümmer
> > <[email protected]>wrote:
> >
> > > Two thoughts:
> > >
> > > - keep track of the baseline and size of characters, if the baseline is
> > > slightly shifted (upwards -> superscript, downward -> subscript) and
> the
> > > size is smaller than surrounding characters, it's possibly a
> superscript
> > or
> > > subscript character
> > >
> > > - be aware of the fact that some fonts contain glyphs for superscripts
> -
> > > then baseline and text size would be the same; in such cases you'd have
> > to
> > > look up via the Unicode code point whether you have encountered a
> > > superscript.
> > >
> > > Olaf
> > >
> > > Am 28 Mar 2014 um 19:23 schrieb Siva Kumar Ch <[email protected]
> >:
> > >
> > > > Hi,
> > > >
> > > > I am trying to extract text from pdf, and process the text. I have
> been
> > > > successful in extraction, but could not get much benefits out of it
> as
> > > the
> > > > extracted text treated the superscripts, usually numbers, as normal
> > text.
> > > >
> > > > A superscript to a word, which is the last word of a sentence, has
> been
> > > > placed after the period(.)
> > > >
> > > > ex: Word: "test" with superscript "super"
> > > > When it appeared at the end of a sentence, has been extracted as -
> > > > "test.super"
> > > >
> > > > Is there any way I can get rid of superscripts?
> > > >
> > > > --
> > > > Br,
> > > > Siva.
> > >
> > >
> >
> >
> > --
> > Peter Murray-Rust
> > Reader in Molecular Informatics
> > Unilever Centre, Dep. Of Chemistry
> > University of Cambridge
> > CB2 1EW, UK
> > +44-1223-763069
> >
>



-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Re: Eliminating super scripts while extracting text from pdf

Reply via email to