PDF text extraction question

Sebastian Freuck Wed, 23 Jun 2010 04:57:05 -0700

Hello. I am trying to extract information about mathematical formulasfrom PDF documents and encountered a problem. See the examplehttp://upload.wikimedia.org/wikibooks/de/f/f6/Mathematik_Stochastik.pdfon page 15 the formula in the middle. P(A) = ...I thought about getting the TextPosition objects and using the encodingof the font to get the glyph name of the characters. This works onlypartially, for example I get for the character '=' the name 'equal' andfor '(' the name 'parenleft'. However for the absolute value charactersI get the name 'j'. Why is this? Does the font not have a separate namefor it? The same with the omega character, it got the 'W' name. Similarthings happened to the infinite character at the end of the page, itshows the 'yen' name for it.When trying to extract the information using the Adobe Reader, I get thesame results. The document was created using pdfTeX. Is this problem thesame for every mathematical pdf? Is there no way to get the informationwhich character it really displays? Also, is this an error in the fontthat the glyph has a completely different name than the character itdisplays?


Yours sincerely


Sebastian

PDF text extraction question

Reply via email to