On 1/1/14 11:49, Khaled Hosny wrote:
The situation in XeTeX is more complex because the typesetting (where the original text string is known) is done in XeTeX, while the PDF generation is done by the PDF driver and the communication channel between both (XDV files) passes only glyph ids not the original text strings
I'd suggest that the best way forward here would be to modify xetex such that it includes the original Unicode text in the xdv stream, as well as the positioned glyphs. Then the driver can write a correct ActualText for each word.
There'd be some performance cost to this, of course; the inclusion of the Unicode text could be an optional feature, so that people who just want a "throwaway" pdf in order to print a document don't have to suffer slower generation and/or larger files.
This wouldn't address all the problems with pdf text extraction; higher-level issues of text structure and flow would still be tricky in the case of documents with any complex layout. But at least the basic Unicode characters making up each word would be reliably correct.
JK -------------------------------------------------- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex