Hi, > Joel Hirsh <[email protected]> hat am 4. Mai 2014 um 21:03 geschrieben: > > > I am using PDFTextStripper and getting odd results on some strings that I > tracked down to something that I think may be a bug in PDFStreamEngine. > > The PDF file has some text that looks like "1234" in Acrobat, but comes > through as "1 2 3 4" from PDFTextStripper. The logic in PDFTextStripper is > putting in spaces because of a large inter-character spacing. > > Tracing it down, the PDF file has a 'Tc' (spacing operator) followed by a > 'Tm' (matrix operator) with a scale of 8. Other PDF files that I could > find with 'Tc' operators had the 'Tc' after the matrix operator. Both parameters are optional, so that their usage is maybe completely different when comparing two pdfs.
> What strikes me as incorrect is that PDFStreamEngine does not distinguish > between a 'Tc' followed by 'Tm' versus a 'Tm' followed by 'Tc' . In either > case the spacing in the 'Tc' is multiplied by the scale factor in the > matrix. There is nothing in the Adobe PDF spec that specifically > addresses order of transforms, but in normal mathematics there is big > difference. And in the case that looks incorrect, the spacing is being > multiplied by the scale in the matrix, and the results would be more like > Acrobat if it didn't. I guess there is a misunderstanding. Both operator don't do any calculations, they just set/replace some values. Other operators like 'Tj' uses those values for calculations, so that the order of those operators isn't relevant. Furthermore in your case it's a simple scaling using scalar values, which is a commutative operation and the order of the operands doesn't matter. > Can someone who might have more knowledge about PDFStreamEngine/ > PDFTextStripper comment on this? The code that does the multiply is in > PDFStreamEngine.processEncodedText when it is operating on the value in > characterSpacingText. Can you share the pdf with us, so that we can have a look to see what might be wrong? > Thanks BR Andreas Lehmkühler

