Hi, I'm trying to extract text and text position data using my own subclass of PDFTextStripper. However, some TextPosition classes generated by PDFStreamEngine in its method processEncodedText() contain garbage text.
The garbage text often contains a few printable characters interspersed with non-printable characters. I've traced the issue back to PDType1CFont, where the method getCharacter() is encoding raw bytes into characters using a "codeToCharacter" map. This map is constructed from data returned by CFFFont#getMappings(), and is a sort of composite map of code-to-name and name-to-character. In this case the codeToCharacter map looks very suspect, and it is indeed there that many codes are mapped to non-printable characters, and only a few are mapped to printable characters: {1=, 2=, 3=, 4=, 5=, 6=, 7=, 8=, 9= , 10= , 78=N, 11=, 12=, 13= , 14=, 15=, 17=, 16=, 19=, 18=, 20=, 102=f, 103=g, 100=d, 101=e, 98=b, 99=c, 97=a, 110=n, 111=o, 108=l, 109=m, 105=i, 117=u, 116=t, 115=s, 114=r, 112=p, 122=z, 121=y} The font is used to render the text of a multi-page article, certainly containing more of the alphabet than those letters represented above (definitely has a 'b' in it). I'm using version 1.1.0 of pdfbox and fontbox. I suppose this could be a problem in CFFParser or CFFFont, or the code to name mapping overrides applied by PDType1CFont#loadOverride(). Has anyone come across a similar problem? Regards, Karl