Hello. I am trying to extract information about mathematical formulas from PDF documents and encountered a problem. See the example http://upload.wikimedia.org/wikibooks/de/f/f6/Mathematik_Stochastik.pdf on page 15 the formula in the middle. P(A) = ... I thought about getting the TextPosition objects and using the encoding of the font to get the glyph name of the characters. This works only partially, for example I get for the character '=' the name 'equal' and for '(' the name 'parenleft'. However for the absolute value characters I get the name 'j'. Why is this? Does the font not have a separate name for it? The same with the omega character, it got the 'W' name. Similar things happened to the infinite character at the end of the page, it shows the 'yen' name for it. When trying to extract the information using the Adobe Reader, I get the same results. The document was created using pdfTeX. Is this problem the same for every mathematical pdf? Is there no way to get the information which character it really displays? Also, is this an error in the font that the glyph has a completely different name than the character it displays?

Yours sincerely

Sebastian

Reply via email to