Re: Embedded PDF font width correction

Shawn A Sun, 08 Feb 2009 11:47:09 -0800

> > I am analyzing and modifying PDF text using PDFBox and regular expressions.
> > Every PDF that needs to be analyzed comes from Microsoft Word. Therefore
> > they contain embedded fonts. When I analyze the text and then replace it, I
> > get text running together like this:
> >
> > http://criminy.webfactional.com/media/images/PDFError02/a_zA_Z0_9_symbols.png
> >
> > Where it should be: akbcdef...pqr...za....@$^...
> >
> > What I've noticed is that MS word writes it's embedded fonts with width
> > values of 0 for some of the letters, which differs on the fonts used and
> > version of MS Word used.  I'm able to fix this by running:
> >
> > font.getWidths().set(ascii('K')-32,new COSFloat((float)690.0));
> >
> > for each offending letter (usually, this is letters with a width of 0). Now
> > I am trying to determine the best way to compute the width of these letters
> > as I would like to be able to apply a general case font width correction,
> > rather than hope that the MS Word pdf generation doesn't mess up the widths
> > any more than they currently are.
> Is this problem independent from the type of font, e.g. TrueType, Type1,
> OpenType etc.?


The PDFs I have had the chance to work with only use embedded TrueType
fonts, so I haven't
seen this on any others.

> > The worst case scenario, I think, is that I can render each letter, crop it
> > and take the pixel width of it, and then convert the pixel width to the text
> > space width. That seems hardly ideal, though. I also do not think that the
> > width of the character is guaranteed to be the same for two differing fonts,
> > or a properties file listing the text space widths would be the easy
> > solution.
> What version of pdfbox do you use?

0.7.4. This is the latest I could get my hands on, and I grabbed it
from the JBoss maven repository.

Re: Embedded PDF font width correction

Reply via email to