subject:"\[XeTeX\] turn off special characters in PDF"

Re: [XeTeX] turn off special characters in PDF

2014-01-06 Thread Ross Moore

Hi Joe,

On 04/01/2014, at 8:43 AM, Joe Corneli wrote:

> Hi All:
> 
> I'm glad my message sparked some discussion.  My M[N]WE for my
> specific use case on tex.stackexchange.com has not gotten much
> attention - I recently attached a +200 bounty.
> 
> http://tex.stackexchange.com/questions/151835/actualtext-in-small-cap-hyperlinks
> 
> I figured I should put in a plug for that here.  I already got a reply
> from one of the main authors of hyperref, but patching \href at the
> necessary level is beyond me.  Finally, I realize a detailed
> discussion of this issue is probably not germane to this list, so if
> you feel that way, please direct further comments there, or to me off
> list.

No, it is quite germane for this list, and relates to
a very recent thread.

The attached PDF is a variant of your example.
Copy/Paste the text using Adobe Reader or Acrobat Pro.
You should get:

Old: Sexy tex: .
New: Sexy tex: sxe .

Apples's Preview (at least within TeXshop) doesn't seem to recognise
the /ActualText  tagging.

accsupp-href.pdf
Description: Adobe PDF document

To achieve this I had to do several things.
Here are the relevant definitions:

\newcommand*{\hrefnew}[2]{%
\hrefold{#1}{\BeginAccSupp{method=pdfstringdef,unicode,ActualText={#2}}#2\EndAccSupp{}}}
\AtBeginDocument{%
 \let\hrefold\href 
 \let\href\hrefnew
}

Notes:
  1. Use \BeginAccSupp and \EndAccSupp  as tightly
 as possible around the text needing to be tagged.

  2. You want the  method=pdfstringdef   option.
 (It is  pdfstringdef  not  pdfstring .)
 This results in appropriate strings for the /ActualText value;
 either ASCII if possible (as here) or UTF16 strings with BOM.

 3.  Delay the rebinding of \href  to \AtBeginDocument .
 This way you do not interfere with any other package making
 its own redefinition of what \href does.

What follows is highly technical and of no real concern to anyone
just wanting to use /ActualText tagging.
Rather it is about implementing this (and more general kinds of)
tagging in the most efficient way.

The result of the above coding is to adjust the PDF page stream 
to include:

  q 
  1 0 0 1 129.04 -82.56 cm 
  /Span<>BDC
  Q BT /F1 11.955 Tf 129.04 -82.56 Td[<095e09630950>]TJ ET q 
  1 0 0 1 145.89 -82.56 cm EMC 
  Q

where you can see the /Span tagging of the content between BDC and EMC.
This works, but is excessive, to my mind, by duplicating some operations.

Now the xdvipdfmx processor allows an alternative form for
the \special  used to place the tagging.
It can be invoked with the following redefinition of internals
from the  accsupp.sty  package:

\makeatletter
 \def\ACCSUPP@bdc{\special {pdf:literal direct \ACCSUPP@span BDC}}
 \def\ACCSUPP@emc{\special {pdf:literal direct EMC}}
\makeatother

This gives a much more efficient PDF stream:

   ...>6<0059001b>]TJ ET
   /Span<>BDC 
   BT /F1 11.955 Tf 129.04 -82.56 Td[<095e09630950>]TJ ET 
   EMC
   BT /F1 11.955 Tf ...

in which the irrelevant coordinate/matrix changes (using 'cm')
no longer occur.

But even this could possibly be improved further to avoid the
extra BT ... ET :

   ...>6<0059001b>]TJ 
   /Span<>BDC 
   /F1 11.955 Tf 129.04 -82.56 Td[<095e09630950>]TJ 
   EMC
   /F1 11.955 Tf ...

In the experimental version of  pdfTeX  there is a
keyword 'noendtext' that can be used with the new 
 \pdfstartmarkedcontent  primitive:

  \pdfstartmarkedcontent attr{} noendtext ...

which is designed with this aim in mind.
Use of this keyword sets a flag so that the matching  
 \pdfendmarkcontent  can keep the BT/ET nesting consistent.

> 
> Thank you!
> 
> Joe

Hope this helps,

Ross

Ross Moore   ross.mo...@mq.edu.au 
Mathematics Department   office: E7A-206  
Macquarie University tel: +61 (0)2 9850 8955
Sydney, Australia  2109  fax: +61 (0)2 9850 8114

<>

--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] turn off special characters in PDF

2014-01-03 Thread Joe Corneli

Hi All:

I'm glad my message sparked some discussion.  My M[N]WE for my
specific use case on tex.stackexchange.com has not gotten much
attention - I recently attached a +200 bounty.

http://tex.stackexchange.com/questions/151835/actualtext-in-small-cap-hyperlinks

I figured I should put in a plug for that here.  I already got a reply
from one of the main authors of hyperref, but patching \href at the
necessary level is beyond me.  Finally, I realize a detailed
discussion of this issue is probably not germane to this list, so if
you feel that way, please direct further comments there, or to me off
list.

Thank you!

Joe

On Wed, Jan 1, 2014 at 10:34 PM, Zdenek Wagner  wrote:
> 2014/1/1 Ross Moore :
>> Hi Zdeněk,
>>
>> On 02/01/2014, at 2:14 AM, Zdenek Wagner wrote:
>>
>>> 2014/1/1 Ross Moore :
>>
 In the example PDF that I attached to my previous message, each 
 mathematical
 character is mapped to a big-endian UTF-16 hexadecimal string, with Plane-1
 alphanumerics expressed using surrogate pairs.

>>> Thank you, now I see it. The book where I read about /ActualText did
>>> not mention that I can use UTF16 if I start the string with BOM.
>>
>> Fair enough; this I had to discover for myself.
>> The PDF Reference Manual (e.g. for ISO 32000) has no such examples,
>> so I had to experiment with different ways to specify strings requiring
>> non-ascii characters. UTF16 is the most elegant, and avoids the messiness
>> of using escape characters and octal codes, even for some non-letter
>> ASCII characters.
>>
>>> Can I
>>> see the source of the PDF? It could help me much to see how you do all
>>> these things.
>>
>> Each piece of mathematics is captured, saved to a file, converted to MathML,
>> then run through my Perl script to create alternative (La)TeX source.
>> This is done to be able to create a fully-tagged PDF description of the
>> mathematical content, using a special version of  pdftex  that Han The Thanh
>> created for me (and others) --- still in experimental stage.
>>
>> You should not need all of this machinery, but I'm happy to answer
>> any questions you may have.
>>
>> I've attached a couple of examples of the output from my Perl script,
>> in which you can see how the /ActualText  replacement strings
>> are specified, using a macro \SMC -- which ultimately expands to use
>> the  \pdfstartmarkedcontent  primitive.
>>
>>
> Thank you.
>>
>>
>> Without the special primitives, you should be able to use  \pdfliteral
>> to insert the tagging needed for just using  /ActualText .
>>

 I see no reason why Indic character strings could not be done similarly.
 You probably need some on-the-fly preprocessing to work out the required
 strings to use.
>>
>>
>> I'm not sure whether there is a LaTeX package that allows you to get the
>> literal bits into the correct place without upsetting other fine
>> details of the typesetting with Indic characters.
>> This certainly should be possible, at least when using  pdfLaTeX .
>> Not sure of the details using XeTeX -- but you work with the source code,
>> so can devise anything that is needed, right?
>>
> Typesetting depends on HarfBuzz and font features, no package is
> needed (fontspec and polyglossia just save work that could be done by
> primitives), any code can be sent to xdvipdfmx by \special{pdf: code
> ...} similarly as by \pdfliteral in pdftex. I already know how to do
> it.
>
>>>
>>> --
>>> Zdeněk Wagner
>>> http://hroch486.icpf.cas.cz/wagner/
>>> http://icebearsoft.euweb.cz
>>
>>
>>
>> Hope this helps,
>>
>> Ross
>>
>> 
>> Ross Moore   ross.mo...@mq.edu.au
>> Mathematics Department   office: E7A-206
>> Macquarie University tel: +61 (0)2 9850 8955
>> Sydney, Australia  2109  fax: +61 (0)2 9850 8114
>> 
>>
>>
>>
>>
>>
>>
>> --
>> Subscriptions, Archive, and List information, etc.:
>>   http://tug.org/mailman/listinfo/xetex
>>
>
>
>
> --
> Zdeněk Wagner
> http://hroch486.icpf.cas.cz/wagner/
> http://icebearsoft.euweb.cz
>
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] turn off special characters in PDF

2014-01-01 Thread Zdenek Wagner

2014/1/1 Ross Moore :
> Hi Zdeněk,
>
> On 02/01/2014, at 2:14 AM, Zdenek Wagner wrote:
>
>> 2014/1/1 Ross Moore :
>
>>> In the example PDF that I attached to my previous message, each mathematical
>>> character is mapped to a big-endian UTF-16 hexadecimal string, with Plane-1
>>> alphanumerics expressed using surrogate pairs.
>>>
>> Thank you, now I see it. The book where I read about /ActualText did
>> not mention that I can use UTF16 if I start the string with BOM.
>
> Fair enough; this I had to discover for myself.
> The PDF Reference Manual (e.g. for ISO 32000) has no such examples,
> so I had to experiment with different ways to specify strings requiring
> non-ascii characters. UTF16 is the most elegant, and avoids the messiness
> of using escape characters and octal codes, even for some non-letter
> ASCII characters.
>
>> Can I
>> see the source of the PDF? It could help me much to see how you do all
>> these things.
>
> Each piece of mathematics is captured, saved to a file, converted to MathML,
> then run through my Perl script to create alternative (La)TeX source.
> This is done to be able to create a fully-tagged PDF description of the
> mathematical content, using a special version of  pdftex  that Han The Thanh
> created for me (and others) --- still in experimental stage.
>
> You should not need all of this machinery, but I'm happy to answer
> any questions you may have.
>
> I've attached a couple of examples of the output from my Perl script,
> in which you can see how the /ActualText  replacement strings
> are specified, using a macro \SMC -- which ultimately expands to use
> the  \pdfstartmarkedcontent  primitive.
>
>
Thank you.
>
>
> Without the special primitives, you should be able to use  \pdfliteral
> to insert the tagging needed for just using  /ActualText .
>
>>>
>>> I see no reason why Indic character strings could not be done similarly.
>>> You probably need some on-the-fly preprocessing to work out the required
>>> strings to use.
>
>
> I'm not sure whether there is a LaTeX package that allows you to get the
> literal bits into the correct place without upsetting other fine
> details of the typesetting with Indic characters.
> This certainly should be possible, at least when using  pdfLaTeX .
> Not sure of the details using XeTeX -- but you work with the source code,
> so can devise anything that is needed, right?
>
Typesetting depends on HarfBuzz and font features, no package is
needed (fontspec and polyglossia just save work that could be done by
primitives), any code can be sent to xdvipdfmx by \special{pdf: code
...} similarly as by \pdfliteral in pdftex. I already know how to do
it.

>>
>> --
>> Zdeněk Wagner
>> http://hroch486.icpf.cas.cz/wagner/
>> http://icebearsoft.euweb.cz
>
>
>
> Hope this helps,
>
> Ross
>
> 
> Ross Moore   ross.mo...@mq.edu.au
> Mathematics Department   office: E7A-206
> Macquarie University tel: +61 (0)2 9850 8955
> Sydney, Australia  2109  fax: +61 (0)2 9850 8114
> 
>
>
>
>
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex
>



-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] turn off special characters in PDF

2014-01-01 Thread Ross Moore

Hi Zdeněk,

On 02/01/2014, at 2:14 AM, Zdenek Wagner wrote:

> 2014/1/1 Ross Moore :

>> In the example PDF that I attached to my previous message, each mathematical
>> character is mapped to a big-endian UTF-16 hexadecimal string, with Plane-1
>> alphanumerics expressed using surrogate pairs.
>> 
> Thank you, now I see it. The book where I read about /ActualText did
> not mention that I can use UTF16 if I start the string with BOM.

Fair enough; this I had to discover for myself.
The PDF Reference Manual (e.g. for ISO 32000) has no such examples,
so I had to experiment with different ways to specify strings requiring
non-ascii characters. UTF16 is the most elegant, and avoids the messiness
of using escape characters and octal codes, even for some non-letter
ASCII characters.

> Can I
> see the source of the PDF? It could help me much to see how you do all
> these things.

Each piece of mathematics is captured, saved to a file, converted to MathML,
then run through my Perl script to create alternative (La)TeX source.
This is done to be able to create a fully-tagged PDF description of the 
mathematical content, using a special version of  pdftex  that Han The Thanh
created for me (and others) --- still in experimental stage.

You should not need all of this machinery, but I'm happy to answer
any questions you may have.

I've attached a couple of examples of the output from my Perl script, 
in which you can see how the /ActualText  replacement strings
are specified, using a macro \SMC — which ultimately expands to use
the  \pdfstartmarkedcontent  primitive.

2013-Assign2-soln-inline-2-tags.tex
Description: Binary data

2013-Assign2-soln-inline-1-tags.tex
Description: Binary data

Without the special primitives, you should be able to use  \pdfliteral 
to insert the tagging needed for just using  /ActualText .

>> 
>> I see no reason why Indic character strings could not be done similarly.
>> You probably need some on-the-fly preprocessing to work out the required
>> strings to use.

I'm not sure whether there is a LaTeX package that allows you to get the
literal bits into the correct place without upsetting other fine
details of the typesetting with Indic characters.
This certainly should be possible, at least when using  pdfLaTeX .
Not sure of the details using XeTeX — but you work with the source code,
so can devise anything that is needed, right?

> 
> -- 
> Zdeněk Wagner
> http://hroch486.icpf.cas.cz/wagner/
> http://icebearsoft.euweb.cz

Hope this helps,

Ross

Ross Moore   ross.mo...@mq.edu.au 
Mathematics Department   office: E7A-206  
Macquarie University tel: +61 (0)2 9850 8955
Sydney, Australia  2109  fax: +61 (0)2 9850 8114

<>

--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] turn off special characters in PDF

2014-01-01 Thread Zdenek Wagner

2014/1/1 Ross Moore :
> Hi Zdenek, and others,
>
> On 01/01/2014, at 11:53, Zdenek Wagner  wrote:
>
> The attached file (produced using pdfTeX, not XeTeX) is an example
>
> that I've used in TUG talks, and elsewhere.
>
> Try copy/paste of portions of the mathematics. Be aware that you can
>
> get different results depending upon the PDF viewer used when
>
> extracting the text.  (The file has uncompressed streams, so you
>
> can view it in a decent text editor to see the tagging structures
>
> used within the PDF content.)
>
>
> If I remember it well, ActualString supports only bytes, not
> cotepoints. Thus accfented characters cannot be encoded, neither Indic
> characters.
>
>
> I don't know what you mean by this.
> In my testing I can tag pretty-much any piece of content, and map it to any
> string using /ActualText .
> Mostly I use Adobe's Acrobat Pro as the PDF reader, and this works fine with
> it,
> modulo some bugs that have been reported when using very long replacement
> strings.
>
> In the example PDF that I attached to my previous message, each mathematical
> character is mapped to a big-endian UTF-16 hexadecimal string, with Plane-1
> alphanumerics expressed using surrogate pairs.
>
Thank you, now I see it. The book where I read about /ActualText did
not mention that I can use UTF16 if I start the string with BOM. Can I
see the source of the PDF? It could help me much to see how you do all
these things.

> I see no reason why Indic character strings could not be done similarly.
> You probably need some on-the-fly preprocessing to work out the required
> strings to use.
> This is certainly possible, and is what I do with mathematical expressions.
> It should be possible to do it entirely within TeX, but the programming can
> get very tricky, so I use Perl instead.
>
> ToUnicode supports one byte to many bytes, not many bytes
> to many bytes.
>
>
> Exactly. This is why /ActualText  is the structure to use.
>
>
> Indic scripts use reordering where a matra precedes the
> consonants or some scripts contain two-piece matras. Unless the
> specification was corrected the ToUnicode map is unable to handle the
> Indic scritps properly.
>
>
> Agreed;  /ToUnicode  is not what is needed here.
> This sounds like precisely the kind of situation where you want to tag an
> extended block of content and use /ActualText  to map it to a
> pre-constructed Unicode string.
> I'm no expert in Indic languages, so cannot provide specific details or
> examples.
>
>
>
> --
>
> Regards,
>
> Alexey Kryukov 
>
>
> Moscow State University
>
> Faculty of History
>
>
>
>
> Hope this helps,
>
>
>Ross
>
>
> --
>
> Zdeněk Wagner
> http://hroch486.icpf.cas.cz/wagner/
> http://icebearsoft.euweb.cz
>
>
> Happy New Year,
>
>
> Ross
>
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex
>



-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] turn off special characters in PDF

2014-01-01 Thread Jonathan Kew


On 1/1/14 11:49, Khaled Hosny wrote:


The situation in XeTeX is more complex because the typesetting (where
the original text string is known) is done in XeTeX, while the PDF
generation is done by the PDF driver and the communication channel
between both (XDV files) passes only glyph ids not the original text
strings


I'd suggest that the best way forward here would be to modify xetex such 
that it includes the original Unicode text in the xdv stream, as well as 
the positioned glyphs. Then the driver can write a correct ActualText 
for each word.


There'd be some performance cost to this, of course; the inclusion of 
the Unicode text could be an optional feature, so that people who just 
want a "throwaway" pdf in order to print a document don't have to suffer 
slower generation and/or larger files.


This wouldn't address all the problems with pdf text extraction; 
higher-level issues of text structure and flow would still be tricky in 
the case of documents with any complex layout. But at least the basic 
Unicode characters making up each word would be reliably correct.


JK



--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] turn off special characters in PDF

2014-01-01 Thread Khaled Hosny

On Wed, Jan 01, 2014 at 10:07:54PM +1100, Ross Moore wrote:
> > ToUnicode supports one byte to many bytes, not many bytes
> > to many bytes.
> 
> Exactly. This is why /ActualText  is the structure to use.

My only issue with /ActualText is that using it to tag whole words
breaks fine text selection (one can not select individual characters
inside these words and searching for one character will highlight the
whole word containing it). Otherwise it is the most versatile mechanism
to preserve original text in PDF files.

Because of that, I think a better strategy is to use /ToUnicode mapping
whenever applicable and resort to /ActualText text for the problematic
cases, namely one to many substitutions, reordering and different
substitutions leading to the same glyph (though the last one can be
handled by duplicating the glyph under different name/encoding when
subsetting the font).

The situation in XeTeX is more complex because the typesetting (where
the original text string is known) is done in XeTeX, while the PDF
generation is done by the PDF driver and the communication channel
between both (XDV files) passes only glyph ids not the original text
strings, so we can only rely on font encodings and glyph names (or try
to guess glyph names from by examining simple font substitutions in the
upcoming patch).

Regards,
Khaled

--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] turn off special characters in PDF

2014-01-01 Thread Ross Moore

Hi Zdenek, and others,

On 01/01/2014, at 11:53, Zdenek Wagner  wrote:

>> The attached file (produced using pdfTeX, not XeTeX) is an example
>> that I've used in TUG talks, and elsewhere.
>> Try copy/paste of portions of the mathematics. Be aware that you can
>> get different results depending upon the PDF viewer used when
>> extracting the text.  (The file has uncompressed streams, so you
>> can view it in a decent text editor to see the tagging structures
>> used within the PDF content.)
>> 
> If I remember it well, ActualString supports only bytes, not
> cotepoints. Thus accfented characters cannot be encoded, neither Indic
> characters.

I don't know what you mean by this.
In my testing I can tag pretty-much any piece of content, and map it to any 
string using /ActualText .
Mostly I use Adobe's Acrobat Pro as the PDF reader, and this works fine with it,
modulo some bugs that have been reported when using very long replacement 
strings.

In the example PDF that I attached to my previous message, each mathematical 
character is mapped to a big-endian UTF-16 hexadecimal string, with Plane-1 
alphanumerics expressed using surrogate pairs. 

I see no reason why Indic character strings could not be done similarly.
You probably need some on-the-fly preprocessing to work out the required 
strings to use.
This is certainly possible, and is what I do with mathematical expressions.
It should be possible to do it entirely within TeX, but the programming can get 
very tricky, so I use Perl instead.

> ToUnicode supports one byte to many bytes, not many bytes
> to many bytes.

Exactly. This is why /ActualText  is the structure to use.

> Indic scripts use reordering where a matra precedes the
> consonants or some scripts contain two-piece matras. Unless the
> specification was corrected the ToUnicode map is unable to handle the
> Indic scritps properly.

Agreed;  /ToUnicode  is not what is needed here.
This sounds like precisely the kind of situation where you want to tag an 
extended block of content and use /ActualText  to map it to a pre-constructed 
Unicode string.
I'm no expert in Indic languages, so cannot provide specific details or 
examples.

>>> 
>>> --
>>> Regards,
>>> Alexey Kryukov 
>>> 
>>> Moscow State University
>>> Faculty of History
>> 
>> 
>> 
>> Hope this helps,
>> 
>>Ross

>> -- 
> Zdeněk Wagner
> http://hroch486.icpf.cas.cz/wagner/
> http://icebearsoft.euweb.cz

Happy New Year,

Ross

--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] turn off special characters in PDF

2013-12-31 Thread Zdenek Wagner

2014/1/1 Ross Moore :
> Hi Alex,
>
> On 31/12/2013, at 5:20 PM, Alexey Kryukov wrote:
>
>> On Mon, 30 Dec 2013 10:45:39 +1100
>> Ross Moore wrote:
>>
>>> I've played a lot with this kind of thing, and think that this
>>> is the wrong approach. One should use /ActualText to provide
>>> the correct Unicode replacement, when one exists. Thus one
>>> can extract textual information reliably, even when the PDF
>>> uses legacy fonts that may not contain a /ToUnicode resource,
>>> or if that resource is inadequate in special situations.
>>
>> Well, the /ActualText approach looks an overcomplication for me. I
>> think it is intended for very special cases, like treating the 'ck'
>> claster in the old German hyphenation rules. For typical ligatures it
>> is sufficient to produce a ToUnicode CMap entry mapping the ligature to
>> its source characters. That's what xetex (actually xdvipdfmx) actually
>> does... unless, as Khaled has correctly specified, the font maps its
>> substitution glyphs to PUA or has no glyph names.
>
> Sure. But if you use such fonts for which the CMap is limited
> in this way, then /ActualText  is your best friend.
>
>>
>> And I don't fully understand your remark regarding legacy fonts that may
>> not contain a /ToUnicode resource, since it's up to the PDF generation
>> software (xdvipdfmx in our case) to produce such a resource.
>
> 1.
> Any time a font character is used in 2 or more different ways,
> corresponding to different Unicode points, you will face such issues.
>
> In legacy (e.g. pre-Unicode) fonts this is not uncommon.
>
> For example, in the original CM fonts, the same font character
> was used for both the dot-under and dot-above accents, using macros
> to put the accent within a box and position it either above or
> below the letter being accented.
> The CMap file can only specify a single value for this character.
> What should be the Unicode value?
> Should it be within the "Combining Character" range?
>
> But it is worse than this: for dot-above, the accent appears
> within the PDF *before* the letter being accented, while for
> the dot-under it comes afterwards. Thus combining characters
> will not work, but can result in the wrong letter being accented.
>
> Using an /ActualText is the only reasonable way to cope with
> this --- apart from switching fonts, of course.
>
>
> 2.
> Another example is the  ellipsis '...' for which people
> often just use '...' in the source.
> One can use /ActualText  to map this combination to the correct
> Unicode character.
>
>
> 3.
> Greek capitals, which look the same as latin letters, is another
> example.
>
>
> 4.
> There are plenty more examples coming from mathematics; especially
> if you variable names to copy/paste as Plane-1 alphanumerics.
>
> The attached file (produced using pdfTeX, not XeTeX) is an example
> that I've used in TUG talks, and elsewhere.
> Try copy/paste of portions of the mathematics. Be aware that you can
> get different results depending upon the PDF viewer used when
> extracting the text.  (The file has uncompressed streams, so you
> can view it in a decent text editor to see the tagging structures
> used within the PDF content.)
>
If I remember it well, ActualString supports only bytes, not
cotepoints. Thus accfented characters cannot be encoded, neither Indic
characters. ToUnicode supports one byte to many bytes, not many bytes
to many bytes. Indic scripts use reordering where a matra precedes the
consonants or some scripts contain two-piece matras. Unless the
specification was corrected the ToUnicode map is unable to handle the
Indic scritps properly.
>
>
>
>
>>
>> --
>> Regards,
>> Alexey Kryukov 
>>
>> Moscow State University
>> Faculty of History
>
>
>
> Hope this helps,
>
> Ross
>
> 
> Ross Moore   ross.mo...@mq.edu.au
> Mathematics Department   office: E7A-206
> Macquarie University tel: +61 (0)2 9850 8955
> Sydney, Australia  2109  fax: +61 (0)2 9850 8114
> 
>
>
>
>
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex
>



-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] turn off special characters in PDF

2013-12-30 Thread Alexey Kryukov

On Mon, 30 Dec 2013 10:45:39 +1100
Ross Moore wrote:

> I've played a lot with this kind of thing, and think that this
> is the wrong approach. One should use /ActualText to provide
> the correct Unicode replacement, when one exists. Thus one
> can extract textual information reliably, even when the PDF
> uses legacy fonts that may not contain a /ToUnicode resource,
> or if that resource is inadequate in special situations.

Well, the /ActualText approach looks an overcomplication for me. I
think it is intended for very special cases, like treating the 'ck'
claster in the old German hyphenation rules. For typical ligatures it
is sufficient to produce a ToUnicode CMap entry mapping the ligature to
its source characters. That's what xetex (actually xdvipdfmx) actually
does... unless, as Khaled has correctly specified, the font maps its
substitution glyphs to PUA or has no glyph names.

And I don't fully understand your remark regarding legacy fonts that may
not contain a /ToUnicode resource, since it's up to the PDF generation
software (xdvipdfmx in our case) to produce such a resource.

-- 
Regards,
Alexey Kryukov 

Moscow State University
Faculty of History

--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] turn off special characters in PDF

2013-12-30 Thread Khaled Hosny

On Mon, Dec 30, 2013 at 06:03:14PM +0100, Zdenek Wagner wrote:
> 2013/12/30 Joe Corneli :
> > Thanks Ross.
> >
> > I think in this case all I really need is to revise \href code to
> > insert /ActualText  (because I'm using small caps for hyperlinks in
> > this doc).  Pretty much everything else works fine already.
> >
> Small caps have nothing to do with the code points, it is just the
> shape of the characters. If you enter \textsc{something}, copy&paste
> should result in lowercase something.

See my reply, this do not always work and it is a known limitation of
XeTeX/(x)dvipdfmx, though the patch submitted by Alexey Kryukov
recently, should improve the situation for many fonts that fail today.

Regards,
Khaled


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] turn off special characters in PDF

2013-12-30 Thread Zdenek Wagner

2013/12/30 Joe Corneli :
> Thanks Ross.
>
> I think in this case all I really need is to revise \href code to
> insert /ActualText  (because I'm using small caps for hyperlinks in
> this doc).  Pretty much everything else works fine already.
>
Small caps have nothing to do with the code points, it is just the
shape of the characters. If you enter \textsc{something}, copy&paste
should result in lowercase something.

> Joe
>
> On Sun, Dec 29, 2013 at 11:45 PM, Ross Moore  wrote:
>> Hi Joe,
>>
>> On 30/12/2013, at 8:12 AM, Joe Corneli wrote:
>>
>>> This answer talks about how to turn off litgatures:
>>> http://tex.stackexchange.com/a/5419/4357
>>>
>>> Is there a way to turn off *all* special characters (e.g. small caps)
>>> and just get ASCII characters in the copy-and-paste level of the PDF?
>>
>> In short, no!
>>  — because this is against the idea of making more use of Unicode,
>> across all computing platforms.
>>
>> Certainly a ligature can have an /ActualText replacement consisting
>> of the separate characters, but this requires the PDF producer
>> to have supplied this within the PDF, as it is being generated.
>>
>> I've played a lot with this kind of thing, and think that this
>> is the wrong approach. One should use /ActualText to provide
>> the correct Unicode replacement, when one exists. Thus one
>> can extract textual information reliably, even when the PDF
>> uses legacy fonts that may not contain a /ToUnicode resource,
>> or if that resource is inadequate in special situations.
>>
>>
>> Besides, do you really mean *all* special characters?
>> What about simple symbols like: ß∑∂√∫Ω  and all the other
>> myriad foreign/accented letters and mathematical symbols?
>>
>> If you want these to Copy/Paste as TeX coding (\beta  \Sum \delta
>> \sqrt etc.) within documents that you write yourself, then I wrote
>> a package called  mmap  where this is an option for the original
>> Computer Modern fonts.
>>
>>
>> Alternatively, a PDF reader might supply a filtering mode that
>> converts the ligatures back to separate characters. Then the
>> user ought to be able to choose whether or not to use this filter.
>> I don't know of any that actually do this.
>> (In any case, you would want such a tool to allow you to specify
>> which characters to replace, and which to preserve.)
>>
>>
>> Your best option is surely to (get someone else to) write such
>> a filter that meets your needs, and use it to post-process the text
>> extracted via Copy/Paste or with other text-extraction tools.
>>
>> Of course this is no use if your aim is to create documents for
>> which others get the desired result via Copy/Paste.
>> For this, the /ActualText approach is what you need.
>>
>>
>>
>> Hope this helps,
>>
>> Ross
>>
>> 
>> Ross Moore   ross.mo...@mq.edu.au
>> Mathematics Department   office: E7A-206
>> Macquarie University tel: +61 (0)2 9850 8955
>> Sydney, Australia  2109  fax: +61 (0)2 9850 8114
>> 
>>
>>
>>
>>
>>
>>
>> --
>> Subscriptions, Archive, and List information, etc.:
>>   http://tug.org/mailman/listinfo/xetex
>>
>
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex



-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] turn off special characters in PDF

2013-12-30 Thread Joe Corneli

Thanks Ross.

I think in this case all I really need is to revise \href code to
insert /ActualText  (because I'm using small caps for hyperlinks in
this doc).  Pretty much everything else works fine already.

Joe

On Sun, Dec 29, 2013 at 11:45 PM, Ross Moore  wrote:
> Hi Joe,
>
> On 30/12/2013, at 8:12 AM, Joe Corneli wrote:
>
>> This answer talks about how to turn off litgatures:
>> http://tex.stackexchange.com/a/5419/4357
>>
>> Is there a way to turn off *all* special characters (e.g. small caps)
>> and just get ASCII characters in the copy-and-paste level of the PDF?
>
> In short, no!
>  — because this is against the idea of making more use of Unicode,
> across all computing platforms.
>
> Certainly a ligature can have an /ActualText replacement consisting
> of the separate characters, but this requires the PDF producer
> to have supplied this within the PDF, as it is being generated.
>
> I've played a lot with this kind of thing, and think that this
> is the wrong approach. One should use /ActualText to provide
> the correct Unicode replacement, when one exists. Thus one
> can extract textual information reliably, even when the PDF
> uses legacy fonts that may not contain a /ToUnicode resource,
> or if that resource is inadequate in special situations.
>
>
> Besides, do you really mean *all* special characters?
> What about simple symbols like: ß∑∂√∫Ω  and all the other
> myriad foreign/accented letters and mathematical symbols?
>
> If you want these to Copy/Paste as TeX coding (\beta  \Sum \delta
> \sqrt etc.) within documents that you write yourself, then I wrote
> a package called  mmap  where this is an option for the original
> Computer Modern fonts.
>
>
> Alternatively, a PDF reader might supply a filtering mode that
> converts the ligatures back to separate characters. Then the
> user ought to be able to choose whether or not to use this filter.
> I don't know of any that actually do this.
> (In any case, you would want such a tool to allow you to specify
> which characters to replace, and which to preserve.)
>
>
> Your best option is surely to (get someone else to) write such
> a filter that meets your needs, and use it to post-process the text
> extracted via Copy/Paste or with other text-extraction tools.
>
> Of course this is no use if your aim is to create documents for
> which others get the desired result via Copy/Paste.
> For this, the /ActualText approach is what you need.
>
>
>
> Hope this helps,
>
> Ross
>
> 
> Ross Moore   ross.mo...@mq.edu.au
> Mathematics Department   office: E7A-206
> Macquarie University tel: +61 (0)2 9850 8955
> Sydney, Australia  2109  fax: +61 (0)2 9850 8114
> 
>
>
>
>
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex
>



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] turn off special characters in PDF

2013-12-29 Thread Ross Moore

Hi Joe,

On 30/12/2013, at 8:12 AM, Joe Corneli wrote:

> This answer talks about how to turn off litgatures:
> http://tex.stackexchange.com/a/5419/4357
> 
> Is there a way to turn off *all* special characters (e.g. small caps)
> and just get ASCII characters in the copy-and-paste level of the PDF?

In short, no!
 — because this is against the idea of making more use of Unicode,
across all computing platforms.

Certainly a ligature can have an /ActualText replacement consisting
of the separate characters, but this requires the PDF producer
to have supplied this within the PDF, as it is being generated.

I've played a lot with this kind of thing, and think that this
is the wrong approach. One should use /ActualText to provide
the correct Unicode replacement, when one exists. Thus one
can extract textual information reliably, even when the PDF
uses legacy fonts that may not contain a /ToUnicode resource,
or if that resource is inadequate in special situations.

Besides, do you really mean *all* special characters?
What about simple symbols like: ß∑∂√∫Ω  and all the other 
myriad foreign/accented letters and mathematical symbols?

If you want these to Copy/Paste as TeX coding (\beta  \Sum \delta  
\sqrt etc.) within documents that you write yourself, then I wrote 
a package called  mmap  where this is an option for the original 
Computer Modern fonts.

Alternatively, a PDF reader might supply a filtering mode that
converts the ligatures back to separate characters. Then the
user ought to be able to choose whether or not to use this filter.
I don't know of any that actually do this.
(In any case, you would want such a tool to allow you to specify
which characters to replace, and which to preserve.)

Your best option is surely to (get someone else to) write such 
a filter that meets your needs, and use it to post-process the text 
extracted via Copy/Paste or with other text-extraction tools.

Of course this is no use if your aim is to create documents for
which others get the desired result via Copy/Paste.
For this, the /ActualText approach is what you need.

Hope this helps,

Ross

Ross Moore   ross.mo...@mq.edu.au 
Mathematics Department   office: E7A-206  
Macquarie University tel: +61 (0)2 9850 8955
Sydney, Australia  2109  fax: +61 (0)2 9850 8114

<>

--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] turn off special characters in PDF

2013-12-29 Thread Khaled Hosny

On Sun, Dec 29, 2013 at 09:12:28PM +, Joe Corneli wrote:
> This answer talks about how to turn off litgatures:
> http://tex.stackexchange.com/a/5419/4357
> 
> Is there a way to turn off *all* special characters (e.g. small caps)
> and just get ASCII characters in the copy-and-paste level of the PDF?

It should just works (unless it is a MS font with no glyph names or font
that uses PUA for those alternate glyphs, in this case the only solution
is to wait for the next TeX Live where this will, hopefully, be
handled).

Regards,
Khaled

--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

[XeTeX] turn off special characters in PDF

2013-12-29 Thread Joe Corneli

This answer talks about how to turn off litgatures:
http://tex.stackexchange.com/a/5419/4357

Is there a way to turn off *all* special characters (e.g. small caps)
and just get ASCII characters in the copy-and-paste level of the PDF?


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] turn off special characters in PDF

Re: [XeTeX] turn off special characters in PDF

Re: [XeTeX] turn off special characters in PDF

Re: [XeTeX] turn off special characters in PDF

Re: [XeTeX] turn off special characters in PDF

Re: [XeTeX] turn off special characters in PDF

Re: [XeTeX] turn off special characters in PDF

Re: [XeTeX] turn off special characters in PDF

Re: [XeTeX] turn off special characters in PDF

Re: [XeTeX] turn off special characters in PDF

Re: [XeTeX] turn off special characters in PDF

Re: [XeTeX] turn off special characters in PDF

Re: [XeTeX] turn off special characters in PDF

Re: [XeTeX] turn off special characters in PDF

Re: [XeTeX] turn off special characters in PDF

[XeTeX] turn off special characters in PDF

16 matches

Site Navigation

Mail list logo

Footer information