Hello,

I am new to pdfbox and pdf format in general, so I apologize, if my questions are uninformed.

I am trying to extract text from a pdf file and some of the characters correctly rendered on the screen (via acrobat) are coming out funny. 99% of the characters from the pdf are extracted correctly, but in one place, for example, what appears like a letter X on the screen is extracted as '}{', in another place two paragraph symbols (¶¶) are extracted as 'iii!'.

After poking around PDFStreamEngine and PDFStreamParser, I can see that the string rendering as 'X' on the screen is coming out of the pdf stream as <007D007B0020> which is 00 + '}' + 00 + '{' 00 + ' ', so that is what is extracted, yet on the screen it is clearly an X with the backward slash thicker then the forward slash, set in a nice serif font, so as far as I understand, inside the pdf it *is* '}{ ', but it renders as 'X' on the screen.

Is there any way I can get that X? or more importantly those¶¶? Where in the pdfbox code can I look to figure it out? Perhaps, I am missing the basic understanding of how character rendering works in pdf. Could someone, please, point me in the right direction? references, links etc?

Thank you,
-ZS


Reply via email to