Hello,
I am new to pdfbox and pdf format in general, so I apologize, if my
questions are uninformed.
I am trying to extract text from a pdf file and some of the characters
correctly rendered on the screen (via acrobat) are coming out funny. 99%
of the characters from the pdf are extracted correctly, but in one
place, for example, what appears like a letter X on the screen is
extracted as '}{', in another place two paragraph symbols (¶¶) are
extracted as 'iii!'.
After poking around PDFStreamEngine and PDFStreamParser, I can see that
the string rendering as 'X' on the screen is coming out of the pdf
stream as <007D007B0020> which is 00 + '}' + 00 + '{' 00 + ' ', so that
is what is extracted, yet on the screen it is clearly an X with the
backward slash thicker then the forward slash, set in a nice serif font,
so as far as I understand, inside the pdf it *is* '}{ ', but it renders
as 'X' on the screen.
Is there any way I can get that X? or more importantly those¶¶? Where in
the pdfbox code can I look to figure it out? Perhaps, I am missing the
basic understanding of how character rendering works in pdf. Could
someone, please, point me in the right direction? references, links etc?
Thank you,
-ZS
- Discrepancy between rendered and extracted characters. Zeev Sands
-