2011/11/19 Ross Moore <ross.mo...@mq.edu.au>: > Hi Zdenek, > > On 19/11/2011, at 10:30 AM, Zdenek Wagner wrote: > >>> /ActualText is your friend here. >>> You tag the content and provide the string that you want to appear >>> with Copy/Paste as the value associated to a dictionary key. >>> >> I do not know whether the PDF specification has evolved since I read >> it the last time. /ActualText allows only single-byte characters, ie >> those with codes between 0 and 255, not arbitrary Unicode characters. > > That is most certainly not true. > You code up UTF-16BE as Hex strings. > > Here is a snippet of the (tagged-pdfLaTeX) source coding from > the main example that I showed in my TUG2011 talk. > The URL for the video of the talk is given in several of my previous emails: > Thank you for the sample. I will try again when I have more time. Maybe there is a stupid bug in my old code. As a matter of fact, when playing with /ActualText I knew much less than now.
>>>> \SMC attr{/ActualText<FEFFD835DC4F>\TPDFaloud{1D44F}} noendtext 254 >>>> {mi}% >>>> b% >>>> _{\noEMC% >>>> \TPDFsub >>>> \SMC attr{/ActualText<FEFFD835DC58>\TPDFaloud{1D458}} noendtext 255 >>>> {mi}% >>>> k% >>>> \EMC >>>> }^{\EMC >>>> \SMC attr{/ActualText( )} noendtext 256 {Span}% >>>> \pdffakespace >>>> \EMC >>>> }% >>>> \TPDFpopbrack >>>> \SMC attr{/ActualText<FEFF0029>\TPDFaloud{0029}} noendtext 257 {mo}% >>>> \Bigr)% > > > Inside the resulting PDF, this content looks like: > >>>> 1 0 0 1 4.902 2.463 cm >>>> /mi <</MCID 10 /ActualText<FEFFD835DC4F>/Alt( , b , ) >>>> >>BDC >>>> BT >>>> /F11 9.9626 Tf >>>> [(b)]TJ >>>> ET >>>> EMC >>>> 1 0 0 1 4.276 4.114 cm >>>> /Span <</MCID 11 /ActualText( ) >>>> >>BDC >>>> BT >>>> /F103 1 Tf >>>> [( )]TJ >>>> ET >>>> EMC >>>> 1 0 0 1 0 -6.577 cm >>>> /mi <</MCID 12 /ActualText<FEFFD835DC58>/Alt( sub k , ) >>>> >>BDC >>>> BT >>>> /F10 6.9738 Tf >>>> [(k)]TJ >>>> ET >>>> EMC >>>> 1 0 0 1 4.901 2.463 cm >>>> /mo <</MCID 13 /Alt( close bracket:, , ) >>>> >>BDC > > > The full PDF passes all of Adobe's validation tests for > correct PDF syntax, Accessible Content, PDF/A-1b compliance. > > More particularly: > > /mi <</MCID 10 /ActualText<FEFFD835DC4F>/Alt( , b , ) > >>BDC > BT > /F11 9.9626 Tf > [(b)]TJ > ET > EMC > > expresses a math-italic 'b' as : > > 1. the glyph in the position of letter 'b' (in CMMI10 font); > > 2. to be spoken aloud as " , b , " where commas indicate a slight pause > > 3. to Copy/Paste as the surrogate pair Ux0D835 Ux0DC4F > equivalent to a Plane-1 math-italic character 'b' . > > The /MCID key is necessary for tagged PDF, but the /Alt and /ActualText > should work independently to full tagging. > The '/mi' is immaterial; it could equally well be '/Span'. > > >> /ActualText is demonstrated on German hyphenated words such as Zucker >> which is hyphenated as Zuk- ker. I have tried to put /ActualText >> manually via a special, I could see it in the PDF file but it did not >> work. > > Yes, because it is quite important to position the tagging pieces > correctly within the PDF content stream. It has to balance correctly > with BT ... ET and the BDC ... EMC operator pairs, and there may > be other subtle requirements. > > Certainly it cannot be done with just a single \special . > There needs to be stuff both before and after the content > that causes actual glyphs to be displayed. > > > Just using \pdfliteral is not sufficient with pdfTeX; we needed > a special modification that allowed the /mi <<...>>BDC > and EMC to fit snuggly around the BT ... ET . > > There could be a similar problem with XeTeX's > \special{pdf:literal ... } > (or whatever is the syntax). > This is the issue that I was trying to discuss with JK in 2009 or 2010. > > >> >> When converting a white space to a space character some [complex] >> heuristics is needed while proper conversion of glyphs to characters >> of Indic scripts require just a few strict rules. The ligatures as TRA >> have to appear in the toUnicode map, otherwise its meaning will be >> unclear. If you see the I-matra, go to the last consonant in the >> sequence and put the I-matra character there. If you see the RA glyph >> at the right edge of a syllable, go back to the leftmost consonant in >> the group and prepend RA+VIRAMA there. This is all what has to be done >> with Devanagari. Other Indic scripts contain two-part vowels but the >> rules will be similarly simple. We should not be forced to double the >> size of the PDF file. AR and other PDF rendering programs should learn >> these simple rules and use them when extracting text. > > If you can provide the UTF-16BE Hex representation of these, > I can create a PDF using it as the /ActualText replacement for > some arbitrary string of letters. > > This will test whether this is a viable approach for Devanagari. > If so, then it is a matter of working out how to expand this > for a full solution. > > >> >>> There is a macro package that can do this with pdfTeX, and it is >>> a vital part of my Tagged PDF work for mathematics. >>> Also, I have an example where the CJK.sty package is extended >>> to tag Chinese characters built from multiple glyphs so that >>> Copy/Paste works correctly (modulo PDF reader quirks). >>> >>> Not sure about XeTeX. >>> >>> I once tried to talk with Jonathan Kew about what would be needed >>> to implement this properly, but he got totally the wrong idea >>> concerning glyphs and characters, and what was needed to be done >>> internally and what by macros. The conversation went nowhere. > >> -- >> Zdeněk Wagner > > > Cheers, > > Ross > > ------------------------------------------------------------------------ > Ross Moore ross.mo...@mq.edu.au > Mathematics Department office: E7A-419 > Macquarie University tel: +61 (0)2 9850 8955 > Sydney, Australia 2109 fax: +61 (0)2 9850 8114 > ------------------------------------------------------------------------ > > > > > > > -------------------------------------------------- > Subscriptions, Archive, and List information, etc.: > http://tug.org/mailman/listinfo/xetex > -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz -------------------------------------------------- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex