Pierre MacKay skrev:
I am following this discussion with great interest, but I wonder whether
the problems of using a font with the Adobe Expert Character set names
have been looked at.
Adobe seems (it is difficult to be sure of the causes) to have set up
Acrobat reader 8 and 9 so that they trap names like Asmall . . .
Zsmall, the old-style figures and the ff ligatures. Unless I use the
on-line distiller at Acrobat.com, I get PDFs in which all characters
from the Expert character set are replaced by blank space.
*What do you mean by* "are replaced by blank space", exactly? Don't
they show up when you view the document, are they missing when you
print the document, or are they missing when you copy text from the
document?
Actually,
not all, because the accented glyphs in the range E0--FF come through.
It is, of course, possible to bypass the problem by using something
other than Reader 8 or 9.
/me typically uses Reader 5 (unless the document has compressed object
streams), because the GUI quality seems (IMHO) to be a decreasing
function of version number. ;-)
Reader 6 and 7 did not have the problem, so
it is something introduced by Adobe in the later versions of Reader. I
submitted a bug report over the problem when Reader 8 came out. It was
acknowledged, and I was told that it would be corrected "in the next
major release." It clearly has not been corrected. One of the worst
aspects of this bug is that it destroys the archival value of all PDFs
distilled before the arrival of Reader 8. (I don't know exactly when
the change was made in Acrobat Distiller, but I suspect that it was
contemporaneous with Reader 8).
A comparison of output from the online distiller at Adobe.com and output
from Ghostscript 8.63 shows that in the Adobe distiller, any font with
the names Asmall . . . Zsmall is treated to two consecutive
operations, the first of which is associated with "/Tounicode." I have
been unable to find out what /Tounicode does. Does it recode the entire
Adobe Expert Character set into a page in the Private use sector?
If the difference involves /ToUnicode, then it should only be Copy text
and Search operations that misbehave, right? (IMO, that wouldn't
destroy the archival value of PDFs, but nor would bugs specific to one
PDF reader.)
FYI, the /ToUnicode entry in a PDF font dictionary sets up a mapping
from slots in the font to Unicode code points; the PDF1.5 spec
describes this in Section 5.9 "Extraction of Text Content". Providing
such a map explicitly is really the only general way to assign an
interpretation to the text in a PDF, but originally Acrobat Reader also
had heuristics for guessing an interpretation from the glyph names. It
is possible that the change in AR8 you observed was merely a retirement
of some of these heuristics, so that "Asmall" is no longer on the list
of known names, even though "a" might still be.
Fontinst has had the ability to generate /ToUnicode CMaps since v1.928
(or thereabout), through the \etxtocmap command. Getting PDF generators
to put it in at the right place is however not so straightforward;
pdfTeX only gives such access to font dictionaries from the TeX side
(whereas the mapfile would be more useful) and it only works for fonts
that have been \font'defed (hence not for base fonts of virtual fonts).
OTOH, recent pdfTeXes seem to have some built-in heuristics of their
own for generating ToUnicode data; I haven't studied those in detail.
Nor do I know what gs or dvipdfmx can currently do in this respect.
There is also the possibility of putting /ActualText data directly into
the page content stream by using pdf: \specials. I've recently
considered adding support for this to fontinst (the specials would be
embedded into the VF; I have figured out how to do it elegantly), but
that's probably only appropriate for faked glyphs (e.g. Euro from C and
two rules). See also the accsupp LaTeX package.
Lars Hellström