Re: [XeTeX] turn off special characters in PDF
Hi Joe, On 04/01/2014, at 8:43 AM, Joe Corneli wrote: > Hi All: > > I'm glad my message sparked some discussion. My M[N]WE for my > specific use case on tex.stackexchange.com has not gotten much > attention - I recently attached a +200 bounty. > > http://tex.stackexchange.com/questions/151835/actualtext-in-small-cap-hyperlinks > > I figured I should put in a plug for that here. I already got a reply > from one of the main authors of hyperref, but patching \href at the > necessary level is beyond me. Finally, I realize a detailed > discussion of this issue is probably not germane to this list, so if > you feel that way, please direct further comments there, or to me off > list. No, it is quite germane for this list, and relates to a very recent thread. The attached PDF is a variant of your example. Copy/Paste the text using Adobe Reader or Acrobat Pro. You should get: Old: Sexy tex: . New: Sexy tex: sxe . Apples's Preview (at least within TeXshop) doesn't seem to recognise the /ActualText tagging. accsupp-href.pdf Description: Adobe PDF document To achieve this I had to do several things. Here are the relevant definitions: \newcommand*{\hrefnew}[2]{% \hrefold{#1}{\BeginAccSupp{method=pdfstringdef,unicode,ActualText={#2}}#2\EndAccSupp{}}} \AtBeginDocument{% \let\hrefold\href \let\href\hrefnew } Notes: 1. Use \BeginAccSupp and \EndAccSupp as tightly as possible around the text needing to be tagged. 2. You want the method=pdfstringdef option. (It is pdfstringdef not pdfstring .) This results in appropriate strings for the /ActualText value; either ASCII if possible (as here) or UTF16 strings with BOM. 3. Delay the rebinding of \href to \AtBeginDocument . This way you do not interfere with any other package making its own redefinition of what \href does. What follows is highly technical and of no real concern to anyone just wanting to use /ActualText tagging. Rather it is about implementing this (and more general kinds of) tagging in the most efficient way. The result of the above coding is to adjust the PDF page stream to include: q 1 0 0 1 129.04 -82.56 cm /Span<>BDC Q BT /F1 11.955 Tf 129.04 -82.56 Td[<095e09630950>]TJ ET q 1 0 0 1 145.89 -82.56 cm EMC Q where you can see the /Span tagging of the content between BDC and EMC. This works, but is excessive, to my mind, by duplicating some operations. Now the xdvipdfmx processor allows an alternative form for the \special used to place the tagging. It can be invoked with the following redefinition of internals from the accsupp.sty package: \makeatletter \def\ACCSUPP@bdc{\special {pdf:literal direct \ACCSUPP@span BDC}} \def\ACCSUPP@emc{\special {pdf:literal direct EMC}} \makeatother This gives a much more efficient PDF stream: ...>6<0059001b>]TJ ET /Span<>BDC BT /F1 11.955 Tf 129.04 -82.56 Td[<095e09630950>]TJ ET EMC BT /F1 11.955 Tf ... in which the irrelevant coordinate/matrix changes (using 'cm') no longer occur. But even this could possibly be improved further to avoid the extra BT ... ET : ...>6<0059001b>]TJ /Span<>BDC /F1 11.955 Tf 129.04 -82.56 Td[<095e09630950>]TJ EMC /F1 11.955 Tf ... In the experimental version of pdfTeX there is a keyword 'noendtext' that can be used with the new \pdfstartmarkedcontent primitive: \pdfstartmarkedcontent attr{} noendtext ... which is designed with this aim in mind. Use of this keyword sets a flag so that the matching \pdfendmarkcontent can keep the BT/ET nesting consistent. > > Thank you! > > Joe Hope this helps, Ross Ross Moore ross.mo...@mq.edu.au Mathematics Department office: E7A-206 Macquarie University tel: +61 (0)2 9850 8955 Sydney, Australia 2109 fax: +61 (0)2 9850 8114 <> -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] turn off special characters in PDF
Hi All: I'm glad my message sparked some discussion. My M[N]WE for my specific use case on tex.stackexchange.com has not gotten much attention - I recently attached a +200 bounty. http://tex.stackexchange.com/questions/151835/actualtext-in-small-cap-hyperlinks I figured I should put in a plug for that here. I already got a reply from one of the main authors of hyperref, but patching \href at the necessary level is beyond me. Finally, I realize a detailed discussion of this issue is probably not germane to this list, so if you feel that way, please direct further comments there, or to me off list. Thank you! Joe On Wed, Jan 1, 2014 at 10:34 PM, Zdenek Wagner wrote: > 2014/1/1 Ross Moore : >> Hi Zdeněk, >> >> On 02/01/2014, at 2:14 AM, Zdenek Wagner wrote: >> >>> 2014/1/1 Ross Moore : >> In the example PDF that I attached to my previous message, each mathematical character is mapped to a big-endian UTF-16 hexadecimal string, with Plane-1 alphanumerics expressed using surrogate pairs. >>> Thank you, now I see it. The book where I read about /ActualText did >>> not mention that I can use UTF16 if I start the string with BOM. >> >> Fair enough; this I had to discover for myself. >> The PDF Reference Manual (e.g. for ISO 32000) has no such examples, >> so I had to experiment with different ways to specify strings requiring >> non-ascii characters. UTF16 is the most elegant, and avoids the messiness >> of using escape characters and octal codes, even for some non-letter >> ASCII characters. >> >>> Can I >>> see the source of the PDF? It could help me much to see how you do all >>> these things. >> >> Each piece of mathematics is captured, saved to a file, converted to MathML, >> then run through my Perl script to create alternative (La)TeX source. >> This is done to be able to create a fully-tagged PDF description of the >> mathematical content, using a special version of pdftex that Han The Thanh >> created for me (and others) --- still in experimental stage. >> >> You should not need all of this machinery, but I'm happy to answer >> any questions you may have. >> >> I've attached a couple of examples of the output from my Perl script, >> in which you can see how the /ActualText replacement strings >> are specified, using a macro \SMC -- which ultimately expands to use >> the \pdfstartmarkedcontent primitive. >> >> > Thank you. >> >> >> Without the special primitives, you should be able to use \pdfliteral >> to insert the tagging needed for just using /ActualText . >> I see no reason why Indic character strings could not be done similarly. You probably need some on-the-fly preprocessing to work out the required strings to use. >> >> >> I'm not sure whether there is a LaTeX package that allows you to get the >> literal bits into the correct place without upsetting other fine >> details of the typesetting with Indic characters. >> This certainly should be possible, at least when using pdfLaTeX . >> Not sure of the details using XeTeX -- but you work with the source code, >> so can devise anything that is needed, right? >> > Typesetting depends on HarfBuzz and font features, no package is > needed (fontspec and polyglossia just save work that could be done by > primitives), any code can be sent to xdvipdfmx by \special{pdf: code > ...} similarly as by \pdfliteral in pdftex. I already know how to do > it. > >>> >>> -- >>> Zdeněk Wagner >>> http://hroch486.icpf.cas.cz/wagner/ >>> http://icebearsoft.euweb.cz >> >> >> >> Hope this helps, >> >> Ross >> >> >> Ross Moore ross.mo...@mq.edu.au >> Mathematics Department office: E7A-206 >> Macquarie University tel: +61 (0)2 9850 8955 >> Sydney, Australia 2109 fax: +61 (0)2 9850 8114 >> >> >> >> >> >> >> >> -- >> Subscriptions, Archive, and List information, etc.: >> http://tug.org/mailman/listinfo/xetex >> > > > > -- > Zdeněk Wagner > http://hroch486.icpf.cas.cz/wagner/ > http://icebearsoft.euweb.cz > > > > -- > Subscriptions, Archive, and List information, etc.: > http://tug.org/mailman/listinfo/xetex -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] turn off special characters in PDF
2014/1/1 Ross Moore : > Hi Zdeněk, > > On 02/01/2014, at 2:14 AM, Zdenek Wagner wrote: > >> 2014/1/1 Ross Moore : > >>> In the example PDF that I attached to my previous message, each mathematical >>> character is mapped to a big-endian UTF-16 hexadecimal string, with Plane-1 >>> alphanumerics expressed using surrogate pairs. >>> >> Thank you, now I see it. The book where I read about /ActualText did >> not mention that I can use UTF16 if I start the string with BOM. > > Fair enough; this I had to discover for myself. > The PDF Reference Manual (e.g. for ISO 32000) has no such examples, > so I had to experiment with different ways to specify strings requiring > non-ascii characters. UTF16 is the most elegant, and avoids the messiness > of using escape characters and octal codes, even for some non-letter > ASCII characters. > >> Can I >> see the source of the PDF? It could help me much to see how you do all >> these things. > > Each piece of mathematics is captured, saved to a file, converted to MathML, > then run through my Perl script to create alternative (La)TeX source. > This is done to be able to create a fully-tagged PDF description of the > mathematical content, using a special version of pdftex that Han The Thanh > created for me (and others) --- still in experimental stage. > > You should not need all of this machinery, but I'm happy to answer > any questions you may have. > > I've attached a couple of examples of the output from my Perl script, > in which you can see how the /ActualText replacement strings > are specified, using a macro \SMC -- which ultimately expands to use > the \pdfstartmarkedcontent primitive. > > Thank you. > > > Without the special primitives, you should be able to use \pdfliteral > to insert the tagging needed for just using /ActualText . > >>> >>> I see no reason why Indic character strings could not be done similarly. >>> You probably need some on-the-fly preprocessing to work out the required >>> strings to use. > > > I'm not sure whether there is a LaTeX package that allows you to get the > literal bits into the correct place without upsetting other fine > details of the typesetting with Indic characters. > This certainly should be possible, at least when using pdfLaTeX . > Not sure of the details using XeTeX -- but you work with the source code, > so can devise anything that is needed, right? > Typesetting depends on HarfBuzz and font features, no package is needed (fontspec and polyglossia just save work that could be done by primitives), any code can be sent to xdvipdfmx by \special{pdf: code ...} similarly as by \pdfliteral in pdftex. I already know how to do it. >> >> -- >> Zdeněk Wagner >> http://hroch486.icpf.cas.cz/wagner/ >> http://icebearsoft.euweb.cz > > > > Hope this helps, > > Ross > > > Ross Moore ross.mo...@mq.edu.au > Mathematics Department office: E7A-206 > Macquarie University tel: +61 (0)2 9850 8955 > Sydney, Australia 2109 fax: +61 (0)2 9850 8114 > > > > > > > > -- > Subscriptions, Archive, and List information, etc.: > http://tug.org/mailman/listinfo/xetex > -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] turn off special characters in PDF
Hi Zdeněk, On 02/01/2014, at 2:14 AM, Zdenek Wagner wrote: > 2014/1/1 Ross Moore : >> In the example PDF that I attached to my previous message, each mathematical >> character is mapped to a big-endian UTF-16 hexadecimal string, with Plane-1 >> alphanumerics expressed using surrogate pairs. >> > Thank you, now I see it. The book where I read about /ActualText did > not mention that I can use UTF16 if I start the string with BOM. Fair enough; this I had to discover for myself. The PDF Reference Manual (e.g. for ISO 32000) has no such examples, so I had to experiment with different ways to specify strings requiring non-ascii characters. UTF16 is the most elegant, and avoids the messiness of using escape characters and octal codes, even for some non-letter ASCII characters. > Can I > see the source of the PDF? It could help me much to see how you do all > these things. Each piece of mathematics is captured, saved to a file, converted to MathML, then run through my Perl script to create alternative (La)TeX source. This is done to be able to create a fully-tagged PDF description of the mathematical content, using a special version of pdftex that Han The Thanh created for me (and others) --- still in experimental stage. You should not need all of this machinery, but I'm happy to answer any questions you may have. I've attached a couple of examples of the output from my Perl script, in which you can see how the /ActualText replacement strings are specified, using a macro \SMC — which ultimately expands to use the \pdfstartmarkedcontent primitive. 2013-Assign2-soln-inline-2-tags.tex Description: Binary data 2013-Assign2-soln-inline-1-tags.tex Description: Binary data Without the special primitives, you should be able to use \pdfliteral to insert the tagging needed for just using /ActualText . >> >> I see no reason why Indic character strings could not be done similarly. >> You probably need some on-the-fly preprocessing to work out the required >> strings to use. I'm not sure whether there is a LaTeX package that allows you to get the literal bits into the correct place without upsetting other fine details of the typesetting with Indic characters. This certainly should be possible, at least when using pdfLaTeX . Not sure of the details using XeTeX — but you work with the source code, so can devise anything that is needed, right? > > -- > Zdeněk Wagner > http://hroch486.icpf.cas.cz/wagner/ > http://icebearsoft.euweb.cz Hope this helps, Ross Ross Moore ross.mo...@mq.edu.au Mathematics Department office: E7A-206 Macquarie University tel: +61 (0)2 9850 8955 Sydney, Australia 2109 fax: +61 (0)2 9850 8114 <> -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] turn off special characters in PDF
2014/1/1 Ross Moore : > Hi Zdenek, and others, > > On 01/01/2014, at 11:53, Zdenek Wagner wrote: > > The attached file (produced using pdfTeX, not XeTeX) is an example > > that I've used in TUG talks, and elsewhere. > > Try copy/paste of portions of the mathematics. Be aware that you can > > get different results depending upon the PDF viewer used when > > extracting the text. (The file has uncompressed streams, so you > > can view it in a decent text editor to see the tagging structures > > used within the PDF content.) > > > If I remember it well, ActualString supports only bytes, not > cotepoints. Thus accfented characters cannot be encoded, neither Indic > characters. > > > I don't know what you mean by this. > In my testing I can tag pretty-much any piece of content, and map it to any > string using /ActualText . > Mostly I use Adobe's Acrobat Pro as the PDF reader, and this works fine with > it, > modulo some bugs that have been reported when using very long replacement > strings. > > In the example PDF that I attached to my previous message, each mathematical > character is mapped to a big-endian UTF-16 hexadecimal string, with Plane-1 > alphanumerics expressed using surrogate pairs. > Thank you, now I see it. The book where I read about /ActualText did not mention that I can use UTF16 if I start the string with BOM. Can I see the source of the PDF? It could help me much to see how you do all these things. > I see no reason why Indic character strings could not be done similarly. > You probably need some on-the-fly preprocessing to work out the required > strings to use. > This is certainly possible, and is what I do with mathematical expressions. > It should be possible to do it entirely within TeX, but the programming can > get very tricky, so I use Perl instead. > > ToUnicode supports one byte to many bytes, not many bytes > to many bytes. > > > Exactly. This is why /ActualText is the structure to use. > > > Indic scripts use reordering where a matra precedes the > consonants or some scripts contain two-piece matras. Unless the > specification was corrected the ToUnicode map is unable to handle the > Indic scritps properly. > > > Agreed; /ToUnicode is not what is needed here. > This sounds like precisely the kind of situation where you want to tag an > extended block of content and use /ActualText to map it to a > pre-constructed Unicode string. > I'm no expert in Indic languages, so cannot provide specific details or > examples. > > > > -- > > Regards, > > Alexey Kryukov > > > Moscow State University > > Faculty of History > > > > > Hope this helps, > > >Ross > > > -- > > Zdeněk Wagner > http://hroch486.icpf.cas.cz/wagner/ > http://icebearsoft.euweb.cz > > > Happy New Year, > > > Ross > > > > -- > Subscriptions, Archive, and List information, etc.: > http://tug.org/mailman/listinfo/xetex > -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] turn off special characters in PDF
On 1/1/14 11:49, Khaled Hosny wrote: The situation in XeTeX is more complex because the typesetting (where the original text string is known) is done in XeTeX, while the PDF generation is done by the PDF driver and the communication channel between both (XDV files) passes only glyph ids not the original text strings I'd suggest that the best way forward here would be to modify xetex such that it includes the original Unicode text in the xdv stream, as well as the positioned glyphs. Then the driver can write a correct ActualText for each word. There'd be some performance cost to this, of course; the inclusion of the Unicode text could be an optional feature, so that people who just want a "throwaway" pdf in order to print a document don't have to suffer slower generation and/or larger files. This wouldn't address all the problems with pdf text extraction; higher-level issues of text structure and flow would still be tricky in the case of documents with any complex layout. But at least the basic Unicode characters making up each word would be reliably correct. JK -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] turn off special characters in PDF
On Wed, Jan 01, 2014 at 10:07:54PM +1100, Ross Moore wrote: > > ToUnicode supports one byte to many bytes, not many bytes > > to many bytes. > > Exactly. This is why /ActualText is the structure to use. My only issue with /ActualText is that using it to tag whole words breaks fine text selection (one can not select individual characters inside these words and searching for one character will highlight the whole word containing it). Otherwise it is the most versatile mechanism to preserve original text in PDF files. Because of that, I think a better strategy is to use /ToUnicode mapping whenever applicable and resort to /ActualText text for the problematic cases, namely one to many substitutions, reordering and different substitutions leading to the same glyph (though the last one can be handled by duplicating the glyph under different name/encoding when subsetting the font). The situation in XeTeX is more complex because the typesetting (where the original text string is known) is done in XeTeX, while the PDF generation is done by the PDF driver and the communication channel between both (XDV files) passes only glyph ids not the original text strings, so we can only rely on font encodings and glyph names (or try to guess glyph names from by examining simple font substitutions in the upcoming patch). Regards, Khaled -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] turn off special characters in PDF
Hi Zdenek, and others, On 01/01/2014, at 11:53, Zdenek Wagner wrote: >> The attached file (produced using pdfTeX, not XeTeX) is an example >> that I've used in TUG talks, and elsewhere. >> Try copy/paste of portions of the mathematics. Be aware that you can >> get different results depending upon the PDF viewer used when >> extracting the text. (The file has uncompressed streams, so you >> can view it in a decent text editor to see the tagging structures >> used within the PDF content.) >> > If I remember it well, ActualString supports only bytes, not > cotepoints. Thus accfented characters cannot be encoded, neither Indic > characters. I don't know what you mean by this. In my testing I can tag pretty-much any piece of content, and map it to any string using /ActualText . Mostly I use Adobe's Acrobat Pro as the PDF reader, and this works fine with it, modulo some bugs that have been reported when using very long replacement strings. In the example PDF that I attached to my previous message, each mathematical character is mapped to a big-endian UTF-16 hexadecimal string, with Plane-1 alphanumerics expressed using surrogate pairs. I see no reason why Indic character strings could not be done similarly. You probably need some on-the-fly preprocessing to work out the required strings to use. This is certainly possible, and is what I do with mathematical expressions. It should be possible to do it entirely within TeX, but the programming can get very tricky, so I use Perl instead. > ToUnicode supports one byte to many bytes, not many bytes > to many bytes. Exactly. This is why /ActualText is the structure to use. > Indic scripts use reordering where a matra precedes the > consonants or some scripts contain two-piece matras. Unless the > specification was corrected the ToUnicode map is unable to handle the > Indic scritps properly. Agreed; /ToUnicode is not what is needed here. This sounds like precisely the kind of situation where you want to tag an extended block of content and use /ActualText to map it to a pre-constructed Unicode string. I'm no expert in Indic languages, so cannot provide specific details or examples. >>> >>> -- >>> Regards, >>> Alexey Kryukov >>> >>> Moscow State University >>> Faculty of History >> >> >> >> Hope this helps, >> >>Ross >> -- > Zdeněk Wagner > http://hroch486.icpf.cas.cz/wagner/ > http://icebearsoft.euweb.cz Happy New Year, Ross -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] turn off special characters in PDF
2014/1/1 Ross Moore : > Hi Alex, > > On 31/12/2013, at 5:20 PM, Alexey Kryukov wrote: > >> On Mon, 30 Dec 2013 10:45:39 +1100 >> Ross Moore wrote: >> >>> I've played a lot with this kind of thing, and think that this >>> is the wrong approach. One should use /ActualText to provide >>> the correct Unicode replacement, when one exists. Thus one >>> can extract textual information reliably, even when the PDF >>> uses legacy fonts that may not contain a /ToUnicode resource, >>> or if that resource is inadequate in special situations. >> >> Well, the /ActualText approach looks an overcomplication for me. I >> think it is intended for very special cases, like treating the 'ck' >> claster in the old German hyphenation rules. For typical ligatures it >> is sufficient to produce a ToUnicode CMap entry mapping the ligature to >> its source characters. That's what xetex (actually xdvipdfmx) actually >> does... unless, as Khaled has correctly specified, the font maps its >> substitution glyphs to PUA or has no glyph names. > > Sure. But if you use such fonts for which the CMap is limited > in this way, then /ActualText is your best friend. > >> >> And I don't fully understand your remark regarding legacy fonts that may >> not contain a /ToUnicode resource, since it's up to the PDF generation >> software (xdvipdfmx in our case) to produce such a resource. > > 1. > Any time a font character is used in 2 or more different ways, > corresponding to different Unicode points, you will face such issues. > > In legacy (e.g. pre-Unicode) fonts this is not uncommon. > > For example, in the original CM fonts, the same font character > was used for both the dot-under and dot-above accents, using macros > to put the accent within a box and position it either above or > below the letter being accented. > The CMap file can only specify a single value for this character. > What should be the Unicode value? > Should it be within the "Combining Character" range? > > But it is worse than this: for dot-above, the accent appears > within the PDF *before* the letter being accented, while for > the dot-under it comes afterwards. Thus combining characters > will not work, but can result in the wrong letter being accented. > > Using an /ActualText is the only reasonable way to cope with > this --- apart from switching fonts, of course. > > > 2. > Another example is the ellipsis '...' for which people > often just use '...' in the source. > One can use /ActualText to map this combination to the correct > Unicode character. > > > 3. > Greek capitals, which look the same as latin letters, is another > example. > > > 4. > There are plenty more examples coming from mathematics; especially > if you variable names to copy/paste as Plane-1 alphanumerics. > > The attached file (produced using pdfTeX, not XeTeX) is an example > that I've used in TUG talks, and elsewhere. > Try copy/paste of portions of the mathematics. Be aware that you can > get different results depending upon the PDF viewer used when > extracting the text. (The file has uncompressed streams, so you > can view it in a decent text editor to see the tagging structures > used within the PDF content.) > If I remember it well, ActualString supports only bytes, not cotepoints. Thus accfented characters cannot be encoded, neither Indic characters. ToUnicode supports one byte to many bytes, not many bytes to many bytes. Indic scripts use reordering where a matra precedes the consonants or some scripts contain two-piece matras. Unless the specification was corrected the ToUnicode map is unable to handle the Indic scritps properly. > > > > >> >> -- >> Regards, >> Alexey Kryukov >> >> Moscow State University >> Faculty of History > > > > Hope this helps, > > Ross > > > Ross Moore ross.mo...@mq.edu.au > Mathematics Department office: E7A-206 > Macquarie University tel: +61 (0)2 9850 8955 > Sydney, Australia 2109 fax: +61 (0)2 9850 8114 > > > > > > > > -- > Subscriptions, Archive, and List information, etc.: > http://tug.org/mailman/listinfo/xetex > -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] turn off special characters in PDF
On Mon, 30 Dec 2013 10:45:39 +1100 Ross Moore wrote: > I've played a lot with this kind of thing, and think that this > is the wrong approach. One should use /ActualText to provide > the correct Unicode replacement, when one exists. Thus one > can extract textual information reliably, even when the PDF > uses legacy fonts that may not contain a /ToUnicode resource, > or if that resource is inadequate in special situations. Well, the /ActualText approach looks an overcomplication for me. I think it is intended for very special cases, like treating the 'ck' claster in the old German hyphenation rules. For typical ligatures it is sufficient to produce a ToUnicode CMap entry mapping the ligature to its source characters. That's what xetex (actually xdvipdfmx) actually does... unless, as Khaled has correctly specified, the font maps its substitution glyphs to PUA or has no glyph names. And I don't fully understand your remark regarding legacy fonts that may not contain a /ToUnicode resource, since it's up to the PDF generation software (xdvipdfmx in our case) to produce such a resource. -- Regards, Alexey Kryukov Moscow State University Faculty of History -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] turn off special characters in PDF
On Mon, Dec 30, 2013 at 06:03:14PM +0100, Zdenek Wagner wrote: > 2013/12/30 Joe Corneli : > > Thanks Ross. > > > > I think in this case all I really need is to revise \href code to > > insert /ActualText (because I'm using small caps for hyperlinks in > > this doc). Pretty much everything else works fine already. > > > Small caps have nothing to do with the code points, it is just the > shape of the characters. If you enter \textsc{something}, copy&paste > should result in lowercase something. See my reply, this do not always work and it is a known limitation of XeTeX/(x)dvipdfmx, though the patch submitted by Alexey Kryukov recently, should improve the situation for many fonts that fail today. Regards, Khaled -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] turn off special characters in PDF
2013/12/30 Joe Corneli : > Thanks Ross. > > I think in this case all I really need is to revise \href code to > insert /ActualText (because I'm using small caps for hyperlinks in > this doc). Pretty much everything else works fine already. > Small caps have nothing to do with the code points, it is just the shape of the characters. If you enter \textsc{something}, copy&paste should result in lowercase something. > Joe > > On Sun, Dec 29, 2013 at 11:45 PM, Ross Moore wrote: >> Hi Joe, >> >> On 30/12/2013, at 8:12 AM, Joe Corneli wrote: >> >>> This answer talks about how to turn off litgatures: >>> http://tex.stackexchange.com/a/5419/4357 >>> >>> Is there a way to turn off *all* special characters (e.g. small caps) >>> and just get ASCII characters in the copy-and-paste level of the PDF? >> >> In short, no! >> — because this is against the idea of making more use of Unicode, >> across all computing platforms. >> >> Certainly a ligature can have an /ActualText replacement consisting >> of the separate characters, but this requires the PDF producer >> to have supplied this within the PDF, as it is being generated. >> >> I've played a lot with this kind of thing, and think that this >> is the wrong approach. One should use /ActualText to provide >> the correct Unicode replacement, when one exists. Thus one >> can extract textual information reliably, even when the PDF >> uses legacy fonts that may not contain a /ToUnicode resource, >> or if that resource is inadequate in special situations. >> >> >> Besides, do you really mean *all* special characters? >> What about simple symbols like: ß∑∂√∫Ω and all the other >> myriad foreign/accented letters and mathematical symbols? >> >> If you want these to Copy/Paste as TeX coding (\beta \Sum \delta >> \sqrt etc.) within documents that you write yourself, then I wrote >> a package called mmap where this is an option for the original >> Computer Modern fonts. >> >> >> Alternatively, a PDF reader might supply a filtering mode that >> converts the ligatures back to separate characters. Then the >> user ought to be able to choose whether or not to use this filter. >> I don't know of any that actually do this. >> (In any case, you would want such a tool to allow you to specify >> which characters to replace, and which to preserve.) >> >> >> Your best option is surely to (get someone else to) write such >> a filter that meets your needs, and use it to post-process the text >> extracted via Copy/Paste or with other text-extraction tools. >> >> Of course this is no use if your aim is to create documents for >> which others get the desired result via Copy/Paste. >> For this, the /ActualText approach is what you need. >> >> >> >> Hope this helps, >> >> Ross >> >> >> Ross Moore ross.mo...@mq.edu.au >> Mathematics Department office: E7A-206 >> Macquarie University tel: +61 (0)2 9850 8955 >> Sydney, Australia 2109 fax: +61 (0)2 9850 8114 >> >> >> >> >> >> >> >> -- >> Subscriptions, Archive, and List information, etc.: >> http://tug.org/mailman/listinfo/xetex >> > > > > -- > Subscriptions, Archive, and List information, etc.: > http://tug.org/mailman/listinfo/xetex -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] turn off special characters in PDF
Thanks Ross. I think in this case all I really need is to revise \href code to insert /ActualText (because I'm using small caps for hyperlinks in this doc). Pretty much everything else works fine already. Joe On Sun, Dec 29, 2013 at 11:45 PM, Ross Moore wrote: > Hi Joe, > > On 30/12/2013, at 8:12 AM, Joe Corneli wrote: > >> This answer talks about how to turn off litgatures: >> http://tex.stackexchange.com/a/5419/4357 >> >> Is there a way to turn off *all* special characters (e.g. small caps) >> and just get ASCII characters in the copy-and-paste level of the PDF? > > In short, no! > — because this is against the idea of making more use of Unicode, > across all computing platforms. > > Certainly a ligature can have an /ActualText replacement consisting > of the separate characters, but this requires the PDF producer > to have supplied this within the PDF, as it is being generated. > > I've played a lot with this kind of thing, and think that this > is the wrong approach. One should use /ActualText to provide > the correct Unicode replacement, when one exists. Thus one > can extract textual information reliably, even when the PDF > uses legacy fonts that may not contain a /ToUnicode resource, > or if that resource is inadequate in special situations. > > > Besides, do you really mean *all* special characters? > What about simple symbols like: ß∑∂√∫Ω and all the other > myriad foreign/accented letters and mathematical symbols? > > If you want these to Copy/Paste as TeX coding (\beta \Sum \delta > \sqrt etc.) within documents that you write yourself, then I wrote > a package called mmap where this is an option for the original > Computer Modern fonts. > > > Alternatively, a PDF reader might supply a filtering mode that > converts the ligatures back to separate characters. Then the > user ought to be able to choose whether or not to use this filter. > I don't know of any that actually do this. > (In any case, you would want such a tool to allow you to specify > which characters to replace, and which to preserve.) > > > Your best option is surely to (get someone else to) write such > a filter that meets your needs, and use it to post-process the text > extracted via Copy/Paste or with other text-extraction tools. > > Of course this is no use if your aim is to create documents for > which others get the desired result via Copy/Paste. > For this, the /ActualText approach is what you need. > > > > Hope this helps, > > Ross > > > Ross Moore ross.mo...@mq.edu.au > Mathematics Department office: E7A-206 > Macquarie University tel: +61 (0)2 9850 8955 > Sydney, Australia 2109 fax: +61 (0)2 9850 8114 > > > > > > > > -- > Subscriptions, Archive, and List information, etc.: > http://tug.org/mailman/listinfo/xetex > -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] turn off special characters in PDF
Hi Joe, On 30/12/2013, at 8:12 AM, Joe Corneli wrote: > This answer talks about how to turn off litgatures: > http://tex.stackexchange.com/a/5419/4357 > > Is there a way to turn off *all* special characters (e.g. small caps) > and just get ASCII characters in the copy-and-paste level of the PDF? In short, no! — because this is against the idea of making more use of Unicode, across all computing platforms. Certainly a ligature can have an /ActualText replacement consisting of the separate characters, but this requires the PDF producer to have supplied this within the PDF, as it is being generated. I've played a lot with this kind of thing, and think that this is the wrong approach. One should use /ActualText to provide the correct Unicode replacement, when one exists. Thus one can extract textual information reliably, even when the PDF uses legacy fonts that may not contain a /ToUnicode resource, or if that resource is inadequate in special situations. Besides, do you really mean *all* special characters? What about simple symbols like: ß∑∂√∫Ω and all the other myriad foreign/accented letters and mathematical symbols? If you want these to Copy/Paste as TeX coding (\beta \Sum \delta \sqrt etc.) within documents that you write yourself, then I wrote a package called mmap where this is an option for the original Computer Modern fonts. Alternatively, a PDF reader might supply a filtering mode that converts the ligatures back to separate characters. Then the user ought to be able to choose whether or not to use this filter. I don't know of any that actually do this. (In any case, you would want such a tool to allow you to specify which characters to replace, and which to preserve.) Your best option is surely to (get someone else to) write such a filter that meets your needs, and use it to post-process the text extracted via Copy/Paste or with other text-extraction tools. Of course this is no use if your aim is to create documents for which others get the desired result via Copy/Paste. For this, the /ActualText approach is what you need. Hope this helps, Ross Ross Moore ross.mo...@mq.edu.au Mathematics Department office: E7A-206 Macquarie University tel: +61 (0)2 9850 8955 Sydney, Australia 2109 fax: +61 (0)2 9850 8114 <> -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] turn off special characters in PDF
On Sun, Dec 29, 2013 at 09:12:28PM +, Joe Corneli wrote: > This answer talks about how to turn off litgatures: > http://tex.stackexchange.com/a/5419/4357 > > Is there a way to turn off *all* special characters (e.g. small caps) > and just get ASCII characters in the copy-and-paste level of the PDF? It should just works (unless it is a MS font with no glyph names or font that uses PUA for those alternate glyphs, in this case the only solution is to wait for the next TeX Live where this will, hopefully, be handled). Regards, Khaled -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
[XeTeX] turn off special characters in PDF
This answer talks about how to turn off litgatures: http://tex.stackexchange.com/a/5419/4357 Is there a way to turn off *all* special characters (e.g. small caps) and just get ASCII characters in the copy-and-paste level of the PDF? -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex