As Olaf says there is no formal support for sub/superscripts in PDF. Generally a lower font size is used and the characters are raised/lowered.
We have written heuristics to detect subsuperscripts in the output of PDFBox. See http://bitbucket.org/petermr/ami and http://bitbucket.org/petermr/svg2xml-dev. This works well for scholarly publishing - it's fairly general but may have to be tweaked for some other applications. I have not commonly found Unicode subsuperscripts being used - it's normally to use other fontsizes and shift. On Fri, Mar 28, 2014 at 9:47 PM, Olaf Drümmer <[email protected]>wrote: > Two thoughts: > > - keep track of the baseline and size of characters, if the baseline is > slightly shifted (upwards -> superscript, downward -> subscript) and the > size is smaller than surrounding characters, it's possibly a superscript or > subscript character > > - be aware of the fact that some fonts contain glyphs for superscripts - > then baseline and text size would be the same; in such cases you'd have to > look up via the Unicode code point whether you have encountered a > superscript. > > Olaf > > Am 28 Mar 2014 um 19:23 schrieb Siva Kumar Ch <[email protected]>: > > > Hi, > > > > I am trying to extract text from pdf, and process the text. I have been > > successful in extraction, but could not get much benefits out of it as > the > > extracted text treated the superscripts, usually numbers, as normal text. > > > > A superscript to a word, which is the last word of a sentence, has been > > placed after the period(.) > > > > ex: Word: "test" with superscript "super" > > When it appeared at the end of a sentence, has been extracted as - > > "test.super" > > > > Is there any way I can get rid of superscripts? > > > > -- > > Br, > > Siva. > > -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

