Hello there, > > I'm using the text extraction of the Apache PDFBox 0.8.0 library. > Unfortunately, the text extraction is replacing some signs and letters by > '?'. >
Without having seen the PDF file, I guess that the problem is that the "faulty" characters depend on a font which is not properly supported by PDFBox 0.8.0 (the translation rules from bytes to character codes could be embedded into the font program; PDFBox does not know yet how to parse/interpret all types of font programs, so it bails out with a "?" instead). Hopefully the upcoming PDFBox 1.0.0 release is a bit more savvy in this regard. VR

