Hi Peter, Thanks.
As suggested, I have gone through the links provided, but unfortunately could not get to the heuristics to detect the subsuperscripts. If possible, please attach or provide a link that can publicly be accessed. Appreciate your help. On Sat, Mar 29, 2014 at 4:19 AM, Peter Murray-Rust <[email protected]> wrote: > As Olaf says there is no formal support for sub/superscripts in PDF. > Generally a lower font size is used and the characters are raised/lowered. > > We have written heuristics to detect subsuperscripts in the output of > PDFBox. See http://bitbucket.org/petermr/ami and > http://bitbucket.org/petermr/svg2xml-dev. This works well for scholarly > publishing - it's fairly general but may have to be tweaked for some other > applications. I have not commonly found Unicode subsuperscripts being used > - it's normally to use other fontsizes and shift. > > > On Fri, Mar 28, 2014 at 9:47 PM, Olaf Drümmer > <[email protected]>wrote: > > > Two thoughts: > > > > - keep track of the baseline and size of characters, if the baseline is > > slightly shifted (upwards -> superscript, downward -> subscript) and the > > size is smaller than surrounding characters, it's possibly a superscript > or > > subscript character > > > > - be aware of the fact that some fonts contain glyphs for superscripts - > > then baseline and text size would be the same; in such cases you'd have > to > > look up via the Unicode code point whether you have encountered a > > superscript. > > > > Olaf > > > > Am 28 Mar 2014 um 19:23 schrieb Siva Kumar Ch <[email protected]>: > > > > > Hi, > > > > > > I am trying to extract text from pdf, and process the text. I have been > > > successful in extraction, but could not get much benefits out of it as > > the > > > extracted text treated the superscripts, usually numbers, as normal > text. > > > > > > A superscript to a word, which is the last word of a sentence, has been > > > placed after the period(.) > > > > > > ex: Word: "test" with superscript "super" > > > When it appeared at the end of a sentence, has been extracted as - > > > "test.super" > > > > > > Is there any way I can get rid of superscripts? > > > > > > -- > > > Br, > > > Siva. > > > > > > > -- > Peter Murray-Rust > Reader in Molecular Informatics > Unilever Centre, Dep. Of Chemistry > University of Cambridge > CB2 1EW, UK > +44-1223-763069 >

