Hi,

I'm trying to extract text and text position data using my own
subclass of PDFTextStripper. However, some TextPosition classes
generated by PDFStreamEngine in its method processEncodedText()
contain garbage text.

The garbage text often contains a few printable characters
interspersed with non-printable characters.

I've traced the issue back to PDType1CFont, where the method
getCharacter() is encoding raw bytes into characters using a
"codeToCharacter" map. This map is constructed from data returned by
CFFFont#getMappings(), and is a sort of composite map of code-to-name
and name-to-character. In this case the codeToCharacter map looks very
suspect, and it is indeed there that many codes are mapped to
non-printable characters, and only a few are mapped to printable
characters:

{1=, 2=, 3=, 4=, 5=, 6=, 7=, 8=, 9=     , 10=
, 78=N, 11=, 12=, 13=
, 14=, 15=, 17=, 16=, 19=, 18=, 20=, 102=f, 103=g, 100=d,
101=e, 98=b, 99=c, 97=a,
 110=n, 111=o, 108=l, 109=m, 105=i, 117=u, 116=t, 115=s, 114=r, 112=p, 122=z,
121=y}

The font is used to render the text of a multi-page article, certainly
containing more of the alphabet than those letters represented above
(definitely has a 'b' in it).

I'm using version 1.1.0 of pdfbox and fontbox.

I suppose this could be a problem in CFFParser or CFFFont, or the code
to name mapping overrides applied by PDType1CFont#loadOverride().

Has anyone come across a similar problem?

Regards,

Karl

Reply via email to