2014/1/1 Ross Moore <ross.mo...@mq.edu.au>: > Hi Zdenek, and others, > > On 01/01/2014, at 11:53, Zdenek Wagner <zdenek.wag...@gmail.com> wrote: > > The attached file (produced using pdfTeX, not XeTeX) is an example > > that I've used in TUG talks, and elsewhere. > > Try copy/paste of portions of the mathematics. Be aware that you can > > get different results depending upon the PDF viewer used when > > extracting the text. (The file has uncompressed streams, so you > > can view it in a decent text editor to see the tagging structures > > used within the PDF content.) > > > If I remember it well, ActualString supports only bytes, not > cotepoints. Thus accfented characters cannot be encoded, neither Indic > characters. > > > I don't know what you mean by this. > In my testing I can tag pretty-much any piece of content, and map it to any > string using /ActualText . > Mostly I use Adobe's Acrobat Pro as the PDF reader, and this works fine with > it, > modulo some bugs that have been reported when using very long replacement > strings. > > In the example PDF that I attached to my previous message, each mathematical > character is mapped to a big-endian UTF-16 hexadecimal string, with Plane-1 > alphanumerics expressed using surrogate pairs. > Thank you, now I see it. The book where I read about /ActualText did not mention that I can use UTF16 if I start the string with BOM. Can I see the source of the PDF? It could help me much to see how you do all these things.
> I see no reason why Indic character strings could not be done similarly. > You probably need some on-the-fly preprocessing to work out the required > strings to use. > This is certainly possible, and is what I do with mathematical expressions. > It should be possible to do it entirely within TeX, but the programming can > get very tricky, so I use Perl instead. > > ToUnicode supports one byte to many bytes, not many bytes > to many bytes. > > > Exactly. This is why /ActualText is the structure to use. > > > Indic scripts use reordering where a matra precedes the > consonants or some scripts contain two-piece matras. Unless the > specification was corrected the ToUnicode map is unable to handle the > Indic scritps properly. > > > Agreed; /ToUnicode is not what is needed here. > This sounds like precisely the kind of situation where you want to tag an > extended block of content and use /ActualText to map it to a > pre-constructed Unicode string. > I'm no expert in Indic languages, so cannot provide specific details or > examples. > > > > -- > > Regards, > > Alexey Kryukov <anagnost at yandex dot ru> > > > Moscow State University > > Faculty of History > > > > > Hope this helps, > > > Ross > > > -- > > Zdeněk Wagner > http://hroch486.icpf.cas.cz/wagner/ > http://icebearsoft.euweb.cz > > > Happy New Year, > > > Ross > > > > -------------------------------------------------- > Subscriptions, Archive, and List information, etc.: > http://tug.org/mailman/listinfo/xetex > -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz -------------------------------------------------- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex