Discrepancy between rendered and extracted characters.

Zeev Sands Sat, 19 Apr 2014 11:58:32 -0700

Hello,

I am new to pdfbox and pdf format in general, so I apologize, if myquestions are uninformed.

I am trying to extract text from a pdf file and some of the characterscorrectly rendered on the screen (via acrobat) are coming out funny. 99%of the characters from the pdf are extracted correctly, but in oneplace, for example, what appears like a letter X on the screen isextracted as '}{', in another place two paragraph symbols (¶¶) areextracted as 'iii!'.

After poking around PDFStreamEngine and PDFStreamParser, I can see thatthe string rendering as 'X' on the screen is coming out of the pdfstream as <007D007B0020> which is 00 + '}' + 00 + '{' 00 + ' ', so thatis what is extracted, yet on the screen it is clearly an X with thebackward slash thicker then the forward slash, set in a nice serif font,so as far as I understand, inside the pdf it *is* '}{ ', but it rendersas 'X' on the screen.

Is there any way I can get that X? or more importantly those¶¶? Where inthe pdfbox code can I look to figure it out? Perhaps, I am missing thebasic understanding of how character rendering works in pdf. Couldsomeone, please, point me in the right direction? references, links etc?


Thank you,
-ZS

Discrepancy between rendered and extracted characters.

Reply via email to