Pierre MacKay skrev:
I am following this discussion with great interest, but I wonder whether the problems of using a font with the Adobe Expert Character set names have been looked at.

Adobe seems (it is difficult to be sure of the causes) to have set up Acrobat reader 8 and 9 so that they trap names like Asmall . . . Zsmall, the old-style figures and the ff ligatures. Unless I use the on-line distiller at Acrobat.com, I get PDFs in which all characters from the Expert character set are replaced by blank space.

*What do you mean by* "are replaced by blank space", exactly? Don't they show up when you view the document, are they missing when you print the document, or are they missing when you copy text from the document?

Actually, not all, because the accented glyphs in the range E0--FF come through.

It is, of course, possible to bypass the problem by using something other than Reader 8 or 9.

/me typically uses Reader 5 (unless the document has compressed object streams), because the GUI quality seems (IMHO) to be a decreasing function of version number. ;-)

Reader 6 and 7 did not have the problem, so it is something introduced by Adobe in the later versions of Reader. I submitted a bug report over the problem when Reader 8 came out. It was acknowledged, and I was told that it would be corrected "in the next major release." It clearly has not been corrected. One of the worst aspects of this bug is that it destroys the archival value of all PDFs distilled before the arrival of Reader 8. (I don't know exactly when the change was made in Acrobat Distiller, but I suspect that it was contemporaneous with Reader 8).

A comparison of output from the online distiller at Adobe.com and output from Ghostscript 8.63 shows that in the Adobe distiller, any font with the names Asmall . . . Zsmall is treated to two consecutive operations, the first of which is associated with "/Tounicode." I have been unable to find out what /Tounicode does. Does it recode the entire Adobe Expert Character set into a page in the Private use sector?

If the difference involves /ToUnicode, then it should only be Copy text and Search operations that misbehave, right? (IMO, that wouldn't destroy the archival value of PDFs, but nor would bugs specific to one PDF reader.)

FYI, the /ToUnicode entry in a PDF font dictionary sets up a mapping from slots in the font to Unicode code points; the PDF1.5 spec describes this in Section 5.9 "Extraction of Text Content". Providing such a map explicitly is really the only general way to assign an interpretation to the text in a PDF, but originally Acrobat Reader also had heuristics for guessing an interpretation from the glyph names. It is possible that the change in AR8 you observed was merely a retirement of some of these heuristics, so that "Asmall" is no longer on the list of known names, even though "a" might still be.

Fontinst has had the ability to generate /ToUnicode CMaps since v1.928 (or thereabout), through the \etxtocmap command. Getting PDF generators to put it in at the right place is however not so straightforward; pdfTeX only gives such access to font dictionaries from the TeX side (whereas the mapfile would be more useful) and it only works for fonts that have been \font'defed (hence not for base fonts of virtual fonts). OTOH, recent pdfTeXes seem to have some built-in heuristics of their own for generating ToUnicode data; I haven't studied those in detail. Nor do I know what gs or dvipdfmx can currently do in this respect.

There is also the possibility of putting /ActualText data directly into the page content stream by using pdf: \specials. I've recently considered adding support for this to fontinst (the specials would be embedded into the VF; I have figured out how to do it elegantly), but that's probably only appropriate for faked glyphs (e.g. Euro from C and two rules). See also the accsupp LaTeX package.

Lars Hellström

Reply via email to