My reply is interleaved below, but there's something important to cover before reading on.

There's clearly a difference in what I mean by de-duplication vs what you're thinking I mean by de-duplication. As far as I can tell you're looking at font substitution and un/re-embedding, where (eg) Helvetica LT Std is replaced with Helvetica Neue Sans, a different version of Helvetica LT Std, the built-in Helvetica derived from Adobe's multi-master fonts, or whatever. The replacement font might not have matching metrics and certainly wouldn't be identical.

That's *not* what I'm talking about. I'm talking about the case where multiple embedded subsets derived from the *exact* *same* *font* exist, each containing partially overlapping sets of glyphs where each glyph is *identical* to those in the other subsets.

This is best illustrated by example. Take three input PDFs that are being placed as images (say, engineering diagrams, advertisments or breakouts in a layout, or whatever), named "1.pdf", "2.pdf" and "3.pdf" that will be written into "out.pdf". For the sake of this example, presume that content in "out.pdf" uses "Arial Regular" for its own text so that font must also be embedded.

1.pdf:
       Helvetica Neue Sans subset [a cde  h]
       Utopia Black               [abcd]
2.pdf:
       Helvetica Neue Sans subset [abcde   ]
       Helvetica LT Std           [ab def  ijk]
3.pdf:
       Helvetica Neue Sans subset [  c efgh]

Desired output is:

o.pdf:
       Helvetica Neue Sans subset [abcdefgh]
       Utopia Black               [abcd]
Helvetica LT Std           [ab def  ijk]
       Arial Regular              (whatever the text in out.pdf requires)

Fop and fop-pdf-image currently produce:

1.pdf:
       Helvetica Neue Sans subset [a cde  h]
       Helvetica Neue Sans subset [abcde   ]
       Helvetica Neue Sans subset [  c efgh]
       Utopia Black               [abcd]
Helvetica LT Std           [ab def  ijk]
       Arial Regular              (whatever the text in out.pdf requires)

... meaning that there are 3 copes of h.n.s "c" plus 2 copies of "d", "e" and "h" from *identical* fonts (presuming each input had the same version of h.n.s as verified by metrics or for the truly paranoid even glyph data checksums). You appear to think I want to produce:

o.pdf:
       Helvetica Neue Sans        [abcdefghijk]
       Utopia Black               [abcd]
       Arial Regular              (whatever the text in out.pdf requires)

or even:

o.pdf:
       Arial Regular              (out.pdf glyph usage plus [abcdefghijk])
       Utopia Black               [abcd]

... where Helvetica Neue Sans and Helvetica LT Std are "de-duplicated" despite not being true duplicates of each other, or in the latter case both are replaced with the "equivalent" (approximately) Arial Regular.

That is *not* what I want; that would be completely incorrect to do automatically.


On 03/06/2012 07:08 PM, mehdi houshmand wrote:
Font de-duping is intrinsically a post-process action, you need the
full document, with all fonts, before you can do any font de-duping.
PostScript does this very thing (to a much lesser extent) with the
<optimize-resources>  tag, as a post-process action.
I absolutely disagree that font optimization must be done in a second pass.






Font de-duplication requires knowledge of all the fonts in the document, yes. That doesn't make it necessarily a post-process operation. PDF is a wonderfully non-linear format, and it's trivial to delay writing out fonts until the end of the document. PDF simply doesn't care where the fonts appear in the document. Once you know the last content stream has been written out (say, just before you write the xref tables) you know no more new glyphs will be used and no new fonts will be referenced, so you can write out the fonts you need.

The only operation in PDF that is (almost) forced to be post-process is writing out linearized ("fast web view" or "web optimized") PDF. That's because web-optimized PDF must have a partial xref table and the trailer dictionary near the *start* of the file. It's actually still possible to create linearised pdf by streaming it out in a single pass, but you need to know more in advance about what you'll be writing out so in practice it's much simpler to linearise by post-processing.

Also, the requirements aren't clear here, what is it we want here? Let
me validate that, this shouldn't change the (I guess we can call it)
"canonical" PDF document. By that I mean if you rasterized a PDF
before and after this change they should be identical,
pixel-for-pixel.
I agree.
When Acrobat does the font de-duping (I don't
remember how much control it gives you, but if there are levels of
de-duping I would have chosen the most aggressive), the documents
aren't identical.
That's because it's actually substituting fonts, replacing one font with another with non-identical metrics. That's not what I want to do, I want to *merge* overlapping subsets of fonts with identical metrics. Since the font dictionary gives the metric information it's practical to do this. If fonts don't have the same metrics, you don't de-dupe them because they're not duplicates.

"Optimizing" a PDF by substituting one font for another is a completely different and much bigger job. Replacement of one font with another non-identical font is a different job that may require rewriting of content streams (for encoding differences), the production of multiple font dictionaries with different encodings to remap different content streams to use one font file, etc. It's hairy and complicated and I don't want to go there.

There are aberrations caused by slight kerning
differences between various verisons of Arial. This may seem trivial
when compared to bloated PDFs, but it looks tacky and lowers the high
standard of documents.
If the metrics don't match, they're not the same font and they don't get merged. The glyph metrics in the font dictionary should be sufficient to handle this.

Having three partial subsets of Arial in a document, each slightly different versions with slightly different metrics, is something I can live with. The problem arises when you have 10 different mostly-overlapping subsets of the *exact* *same* *glyph* *data* from each of those, leaving you with *30* small-ish copies of Arial instead of 3 slightly larger ones.

The other issue is you have subset fonts created by FOP as well as
those imported by the pdf-image-plugin. You'd have to create some
bridge between the image loading framework and the font loading system
*cough* HACK *cough*.
Only if you want to handle de-dupe between fop-loaded fonts and fonts loaded from input PDFs. I don't think that's particularly vital, but it might not be as bad as you think either.

The font matching and subset merging system required for pdf-image to de-dupe fonts would have to track glyph metrics, font names, etc for every font seen, and would need to accumulate information on needed glyphs, etc until the end of output generation just before the xref is written. Fop must maintain used-glyph information as it stands, and already knows glyph metrics, so it's entirely practical for it to report that into the same system. From there, it's not too much of a stretch to see pdf-image recognising that fop is going to embed a font with the same name and metrics already and just merging its required-glyph list with fop's before fop generates the subset.

That's a significantly bigger project, though. Just being able to merge completely redundant glyph subsets where the glyph data and metrics are exactly identical between partially overlapping subsets being loaded by fop-pdf-image would be a nice start.

The best thing about all this it that it's practical to do it progressively.

  Alternatively, just thinking aloud here, if this
was done as a post-process *wink* *wink* *wry smile*...
While it can be done in post-process, I'm really not convinced it's necessary. FOP handles image scaling and resampling - why don't we do that in post-process, too? Just generate a monstrously huge PDF full of uncompressed images, then re-sample later?

The answer seems to be because it's practical to do it in one pass, it's nicer for users, and it works well.

Why does fop have font subsetting support? Subsetting can be done in post-process, all you have to do is read the content streams and determine which glyphs are used, then rewrite the font. It's done in a single pass because it's *much* easier to implement that way, when fop already knows the glyphs it's used. Same deal: it could be done in a post pass, but it isn't because it doesn't make sense to do so.

Font replacement and the substitution of non-identical fonts should be done in post, because it's not practical to do them in a way that's going to be easy, reliable and automatic, nor are there any obvious correct choices. We don't know if the document designer wants to replace their own copy of Helvetica with Adobe's multi-master version. On the other hand, it's pretty bloody obvious that the user won't want 100 copies each of "abcdefg...." glyphs from Helvetica LT Std that are *exactly* *the* *same* when they can have just one copy of each with no effect on document display.

Apologies if I may seem to be argumentative here, it's not my
intention, but I feel this is would be serious scope creep. I see the
pdf-image-plugin as a plugin that treats PDFs as images, nothing more.
If you want to stitch together PDFs, PDFBox is designed just for that.

The trouble is that fop-pdf-image exists because PDFs aren't just images. If they were, it'd be much easier to just rasterise them and import them in raster form.

FWIW, I'm not trying to use fop to "stitch together PDFs" - not in the sense of trying to use it to append, n-up, impose, etc complex PDF documents. I'm using small PDFs that are basically "images" - but represented as a combination of raster, text and bitmap data that should be included in the output document as efficiently as possible and without loss of fidelity. IOW, exactly what fop-pdf-image is for.

--
Craig Ringer

Reply via email to