On Wed, Jan 01, 2014 at 10:07:54PM +1100, Ross Moore wrote: > > ToUnicode supports one byte to many bytes, not many bytes > > to many bytes. > > Exactly. This is why /ActualText is the structure to use.
My only issue with /ActualText is that using it to tag whole words breaks fine text selection (one can not select individual characters inside these words and searching for one character will highlight the whole word containing it). Otherwise it is the most versatile mechanism to preserve original text in PDF files. Because of that, I think a better strategy is to use /ToUnicode mapping whenever applicable and resort to /ActualText text for the problematic cases, namely one to many substitutions, reordering and different substitutions leading to the same glyph (though the last one can be handled by duplicating the glyph under different name/encoding when subsetting the font). The situation in XeTeX is more complex because the typesetting (where the original text string is known) is done in XeTeX, while the PDF generation is done by the PDF driver and the communication channel between both (XDV files) passes only glyph ids not the original text strings, so we can only rely on font encodings and glyph names (or try to guess glyph names from by examining simple font substitutions in the upcoming patch). Regards, Khaled -------------------------------------------------- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex