Re: [XeTeX] Whitespace in input

2011-11-19 Thread Keith J. Schultz
Hi Zdenek,

I do not think anybody disputes the fact that characters are not glyphs.

The confusion arises that a character in CS is well defined and has a 
history.
To be more exact it is just one byte in size so that there can be only 
256 characters.

Unicode has change all this. and we have a unicode character which is 
of different sizes
depending on the unicode encoding used.

It gets even hairier as in unicode several unicode characters can be 
combined (composed).
the result to be output is known as a glyph!

The average user considers a glyph to be the same as a letter and 
thereby a character.

Now, in order to process the glyphs with a computer it must be 
decomposed back to unicode.
How well this is done depends of the system its self. If the system is 
not fully unicode aware and
implements in properly then there will be problems. What adds to the 
complexity of the problem is that 
not all fonts used for displaying unicode contain all code points, 
Thereby, creating your many to many
decomposition. 

As for getting junk when copying unicode, just copy between to text 
using different fonts, where one font does 
not contain the glyph.

The only true way to master this problem is if the computer world would 
go completely full unicode with 
fonts support the full unicode code set!

That is impractical for the time being.

The only advise I can give is choose your tools wisely.

regards
Keith.

Am 18.11.2011 um 23:51 schrieb Zdenek Wagner:

 2011/11/18 maxwell maxw...@umiacs.umd.edu:
 On Fri, 18 Nov 2011 13:52:56 +0100, Zdenek Wagner
 zdenek.wag...@gmail.com
 wrote:
 2011/11/18 Philip TAYLOR p.tay...@rhul.ac.uk:
 Is it safe to assume that these code listings
 are restricted to the ASCII character set ?  If
 so, yes, spaces are likely to be a problem, but
 if the code listing can also include ligature-
 digraphs, then these are likely to prove even
 more problematic.
 
 If the code listing is typeset in a fixed width font, it is usually no
 problem. I copied a few code samples from books in PDF, most of them
 were typeset by TeX. If I want to copy text in Devanagari, it is
 almost impossible.
 
 Besides TeX, Dr. Knuth also invented Literate Programming.  In our own
 project, we use LP to extract the code listings from the original source
 code, rather than from the PDF.  One advantage is that in addition to the
 re-ordering at the character level (mentioned in part of Zdenek's email
 that I didn't copy over), this allows re-ordering at any arbitrary level,
 
 This is a demonstration that glyphs are not the same as characters. I
 will startt with a simpler case and will not put Devanagari to the
 mail message. If you wish to write a syllable RU, you have to add a
 dependent vowel (matra) U to a consonant RA. There is a ligature RU,
 so in PDF you will not see RA consonant with U matra but a RU glyph.
 Similarly, TRA is a single glyph representing the following
 characters: TA+VIRAMA+RA. The toUnicode map supports 1:1 and 1:many
 mappings thus it is possible to handle these cases when copying text
 from a PDF or when searching. More difficult case is I matra (short
 dependent vowel I). As a character it must always follow a consonant
 (this is a general rule for all dependent vowels) but visually (as a
 glyph) it precedes the consonant group after which it is pronounced.
 The sample word was kitab (it means a book). In Unicode (as
 characters) the order is KA+I-matra+TA+A-matra(long)+BA. Visually
 I-matra precedes KA. XeTeX (knowing that it works with a Devanagari
 script) runs the character sequence through ICU and the result is the
 glyph sequence. The original sequence is lost so that when the text is
 copied from PDF, we get (not exactly) i*katab. Microsoft suggested
 what additional characters should appear in Indic OpenType fonts. One
 of them is a dotted ring which denotes a missing consonant. I-matra
 must always follow a consonant (in character order). If it is moved to
 the beginning of a word, it is wrong. If you paste it to a text
 editor, the OpenType rendering engine should display a missing
 consonant as a dotted ring (if it is present in the font). In
 character order the dotted ring will precede I-matra but in visual
 (glyph) order it will be just opposite. Thus the asterisk shows the
 place where you will see the dotted circle. This is just one simple
 case. I-matra may follow a consonant group, such as in word PRIY
 (dear) which is PA+VIRAMA+RA+I-matra+YA or STRIYOCIT (good for women)
 which is SA+VIRAMA+TA+VIRAMA+RA+I-matra+YA+O-matra+CA+I-matra+TA. Both
 words will start with the I-matra glyph. The latter will contain two
 ordering bugs after copypaste. Consider also word MURTI (statue)
 which is a sequence of characters
 MA+U-matra(long)+RA+VIRAMA+TA+I-matra. Visually the long U-matra will
 

Re: [XeTeX] Whitespace in input

2011-11-19 Thread Philip TAYLOR



Keith J. Schultz wrote:


I do not think anybody disputes the fact that characters are not glyphs.

The confusion arises that a character in CS is well defined and has a 
history.
To be more exact it is just one byte in size so that there can be only 
256 characters.


Sorry, Keith, this is patently untrue.  Replace is by was once and
you get a little closer to the truth, but you still completely ignore
issues such as the difference between (say) EBCDIC and ASCII.  CDC machines
used a 60-bit word, and one character was six bits, not eight.  And before
the advent of the extended character set, a character consisted of seven
bits plus a parity bit, thus yielding at most 128 characters of which
32 were reserved for control functions.


The average user considers a glyph to be the same as a letter and 
thereby a character.


It is rarely safe to believe that one knows what the average user thinks ...


Now, in order to process the glyphs with a computer it must be 
decomposed back to unicode.


But one rarely, if ever, processes glyphs; the glyphs are the end result,
not the input.  Glyph processing does become necessary in languages such
as Arabic, where context has a major impact on the way in which the
individual glyphs are presented, but in Western languages the nearest we
get to glyph processing is in the formation of ligature digraphs.


How well this is done depends of the system its self. If the system is 
not fully unicode aware and
implements in properly then there will be problems. What adds to the 
complexity of the problem is that
not all fonts used for displaying unicode contain all code points, 
Thereby, creating your many to many
decomposition.

As for getting junk when copying unicode, just copy between to text 
using different fonts, where one font does
not contain the glyph.

The only true way to master this problem is if the computer world would 
go completely full unicode with
fonts support the full unicode code set!


I personally hope that this does not happen, and that before then
we have an Omnicode consortium to review the mistakes of Unicode
and to address them in a future, more orthogonal, more consistent,
specification.

Philip Taylor


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-19 Thread Zdenek Wagner
2011/11/19 Ross Moore ross.mo...@mq.edu.au:
 Hi Zdenek,

 On 19/11/2011, at 10:30 AM, Zdenek Wagner wrote:

 /ActualText is your friend here.
 You tag the content and provide the string that you want to appear
 with Copy/Paste as the value associated to a dictionary key.

 I do not know whether the PDF specification has evolved since I read
 it the last time. /ActualText allows only single-byte characters, ie
 those with codes between 0 and 255, not arbitrary Unicode characters.

 That is most certainly not true.
 You code up UTF-16BE as Hex strings.

 Here is a snippet of the (tagged-pdfLaTeX) source coding from
 the main example that I showed in my  TUG2011 talk.
 The URL for the video of the talk is given in several of my previous emails:

Thank you for the sample. I will try again when I have more time.
Maybe there is a stupid bug in my old code. As a matter of fact, when
playing with /ActualText I knew much less than now.

    \SMC attr{/ActualTextFEFFD835DC4F\TPDFaloud{1D44F}} noendtext 254 
 {mi}%
  b%
    _{\noEMC%
   \TPDFsub
    \SMC attr{/ActualTextFEFFD835DC58\TPDFaloud{1D458}} noendtext 255 
 {mi}%
  k%
    \EMC
  }^{\EMC
    \SMC attr{/ActualText( )} noendtext 256 {Span}%
  \pdffakespace
    \EMC
  }%
    \TPDFpopbrack
    \SMC attr{/ActualTextFEFF0029\TPDFaloud{0029}} noendtext 257 {mo}%
  \Bigr)%


 Inside the resulting PDF, this content looks like:

 1 0 0 1 4.902 2.463 cm
 /mi /MCID 10 /ActualTextFEFFD835DC4F/Alt(  , b ,  )
 BDC
 BT
 /F11 9.9626 Tf
  [(b)]TJ
 ET
 EMC
 1 0 0 1 4.276 4.114 cm
 /Span /MCID 11 /ActualText( )
 BDC
 BT
 /F103 1 Tf
  [( )]TJ
 ET
 EMC
 1 0 0 1 0 -6.577 cm
 /mi /MCID 12 /ActualTextFEFFD835DC58/Alt(  sub k ,  )
 BDC
 BT
 /F10 6.9738 Tf
  [(k)]TJ
 ET
 EMC
 1 0 0 1 4.901 2.463 cm
 /mo /MCID 13 /Alt(  close bracket:,   , )
 BDC


 The full PDF passes all of Adobe's validation tests for
 correct PDF syntax, Accessible Content, PDF/A-1b compliance.

 More particularly:

  /mi /MCID 10 /ActualTextFEFFD835DC4F/Alt(  , b ,  )
  BDC
  BT
  /F11 9.9626 Tf
   [(b)]TJ
  ET
  EMC

 expresses a math-italic 'b' as :

  1.  the glyph in the position of letter 'b' (in CMMI10  font);

  2.  to be spoken aloud as   , b ,   where commas indicate a slight pause

  3.  to Copy/Paste as the surrogate pair  Ux0D835 Ux0DC4F
      equivalent to a Plane-1 math-italic character 'b' .

 The /MCID key is necessary for tagged PDF, but the /Alt and /ActualText
 should work independently to full tagging.
 The '/mi' is immaterial; it could equally well be  '/Span'.


 /ActualText is demonstrated on German hyphenated words such as Zucker
 which is hyphenated as Zuk- ker. I have tried to put /ActualText
 manually via a special, I could see it in the PDF file but it did not
 work.

 Yes, because it is quite important to position the tagging pieces
 correctly within the PDF content stream. It has to balance correctly
 with BT ... ET  and the BDC ... EMC  operator pairs, and there may
 be other subtle requirements.

 Certainly it cannot be done with just a single \special .
 There needs to be stuff both before and after the content
 that causes actual glyphs to be displayed.


 Just using \pdfliteral  is not sufficient with pdfTeX; we needed
 a special modification that allowed the  /mi ...BDC
 and  EMC to fit snuggly around the  BT ... ET .

 There could be a similar problem with XeTeX's
     \special{pdf:literal ... }
 (or whatever is the syntax).
 This is the issue that I was trying to discuss with JK in 2009 or 2010.



 When converting a white space to a space character some [complex]
 heuristics is needed while proper conversion of glyphs to characters
 of Indic scripts require just a few strict rules. The ligatures as TRA
 have to appear in the toUnicode map, otherwise its meaning will be
 unclear. If you see the I-matra, go to the last consonant in the
 sequence and put the I-matra character there. If you see the RA glyph
 at the right edge of a syllable, go back to the leftmost consonant in
 the group and prepend RA+VIRAMA there. This is all what has to be done
 with Devanagari. Other Indic scripts contain two-part vowels but the
 rules will be similarly simple. We should not be forced to double the
 size of the PDF file. AR and other PDF rendering programs should learn
 these simple rules and use them when extracting text.

 If you can provide the  UTF-16BE Hex representation of these,
 I can create a PDF using it as the /ActualText  replacement for
 some arbitrary string of letters.

 This will test whether this is a viable approach for Devanagari.
 If so, then it is a matter of working out how to expand this
 for a full solution.



 There is a macro package that can do this with pdfTeX, and it is
 a vital part of my Tagged PDF work for mathematics.
 Also, I have an example where the CJK.sty package is extended
 to tag Chinese characters built from multiple glyphs so that
 Copy/Paste works correctly (modulo PDF reader quirks).

 Not sure about XeTeX.

 I once tried to talk with Jonathan Kew 

Re: [XeTeX] Whitespace in input

2011-11-19 Thread Zdenek Wagner
2011/11/19 Keith J. Schultz keithjschu...@web.de:
 Hi Zdenek,

        I do not think anybody disputes the fact that characters are not 
 glyphs.

        The confusion arises that a character in CS is well defined and has a 
 history.
        To be more exact it is just one byte in size so that there can be only 
 256 characters.

        Unicode has change all this. and we have a unicode character which is 
 of different sizes
        depending on the unicode encoding used.

        It gets even hairier as in unicode several unicode characters can be 
 combined (composed).
        the result to be output is known as a glyph!

        The average user considers a glyph to be the same as a letter and 
 thereby a character.

        Now, in order to process the glyphs with a computer it must be 
 decomposed back to unicode.
        How well this is done depends of the system its self. If the system is 
 not fully unicode aware and
        implements in properly then there will be problems. What adds to the 
 complexity of the problem is that
        not all fonts used for displaying unicode contain all code points, 
 Thereby, creating your many to many
        decomposition.

No, conversion of a sequence of glyphs to a sequence of unicode
codepoints has little to do with fonts. Position of RU ligature in the
font may differ, but it is handled easily by the toUnicode map.
Conjunct STA may also occupy different position in different fonts but
it can always be printed using two glyphs, half-SA + TA. In general,
the half forms should be decoded as the full form followed by VIRAMA.
This makes the toUnicode table smaller and leads to correct results.
The only problem is correct ordering of a few characters.

        As for getting junk when copying unicode, just copy between to text 
 using different fonts, where one font does
        not contain the glyph.

When performing copypaste or text search in PDF, I am not interested
in glyphs but in characters. I do not care what glyphs will be
displayed. If I copy the text to OpenOffice, I can change the font
later and if the codepoints were transferred correctly, I will see the
text (it was true even with OpenOffice 1.x, I tried many years ago).
If I copy the text to gedit, ontconfig will automatically find a font
for displaying the characters not present in the current font. I still
have to read the fontconfig manual in order to find how to configure
its searching algorithm. Arabic fonts may be a problem especially if
you wish to use Arabic, Persian and Urdu. Now I know that I have to
force fontonfic to select automatically SIL Scheherezade because it
contains all characters. I can thus use both U+0643 and U+06A. When
writing Akbar, I can write it both in Arabic and in Urdu/Farsi.

        The only true way to master this problem is if the computer world 
 would go completely full unicode with
        fonts support the full unicode code set!

        That is impractical for the time being.

fontconfig currently has the solution and usually works out of the box.
        The only advise I can give is choose your tools wisely.

        regards
                Keith.

 Am 18.11.2011 um 23:51 schrieb Zdenek Wagner:

 2011/11/18 maxwell maxw...@umiacs.umd.edu:
 On Fri, 18 Nov 2011 13:52:56 +0100, Zdenek Wagner
 zdenek.wag...@gmail.com
 wrote:
 2011/11/18 Philip TAYLOR p.tay...@rhul.ac.uk:
 Is it safe to assume that these code listings
 are restricted to the ASCII character set ?  If
 so, yes, spaces are likely to be a problem, but
 if the code listing can also include ligature-
 digraphs, then these are likely to prove even
 more problematic.

 If the code listing is typeset in a fixed width font, it is usually no
 problem. I copied a few code samples from books in PDF, most of them
 were typeset by TeX. If I want to copy text in Devanagari, it is
 almost impossible.

 Besides TeX, Dr. Knuth also invented Literate Programming.  In our own
 project, we use LP to extract the code listings from the original source
 code, rather than from the PDF.  One advantage is that in addition to the
 re-ordering at the character level (mentioned in part of Zdenek's email
 that I didn't copy over), this allows re-ordering at any arbitrary level,

 This is a demonstration that glyphs are not the same as characters. I
 will startt with a simpler case and will not put Devanagari to the
 mail message. If you wish to write a syllable RU, you have to add a
 dependent vowel (matra) U to a consonant RA. There is a ligature RU,
 so in PDF you will not see RA consonant with U matra but a RU glyph.
 Similarly, TRA is a single glyph representing the following
 characters: TA+VIRAMA+RA. The toUnicode map supports 1:1 and 1:many
 mappings thus it is possible to handle these cases when copying text
 from a PDF or when searching. More difficult case is I matra (short
 dependent vowel I). As a character it must always follow a consonant
 (this is a general rule for all dependent vowels) but visually (as 

Re: [XeTeX] Whitespace in input

2011-11-19 Thread Ulrike Fischer
Am Sat, 19 Nov 2011 00:30:58 +0100 schrieb Zdenek Wagner:

 /ActualText is your friend here.
 You tag the content and provide the string that you want to appear
 with Copy/Paste as the value associated to a dictionary key.

 I do not know whether the PDF specification has evolved since I read
 it the last time. /ActualText allows only single-byte characters, ie
 those with codes between 0 and 255, not arbitrary Unicode characters.

This here works fine with pdflatex + xetex:

\documentclass{article}
\usepackage{accsupp}
\begin{document}
\BeginAccSupp{method=hex,unicode,ActualText=20AC}%
 Euro%
\EndAccSupp{}%

\BeginAccSupp{method=hex,unicode,ActualText=03B1}%
 alpha%
\EndAccSupp{}%
\end{document}

-- 
Ulrike Fischer 



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-19 Thread Zdenek Wagner
2011/11/19 Ulrike Fischer ne...@nililand.de:
 Am Sat, 19 Nov 2011 00:30:58 +0100 schrieb Zdenek Wagner:

 /ActualText is your friend here.
 You tag the content and provide the string that you want to appear
 with Copy/Paste as the value associated to a dictionary key.

 I do not know whether the PDF specification has evolved since I read
 it the last time. /ActualText allows only single-byte characters, ie
 those with codes between 0 and 255, not arbitrary Unicode characters.

 This here works fine with pdflatex + xetex:

Thank you, the package looks useful.

 \documentclass{article}
 \usepackage{accsupp}
 \begin{document}
 \BeginAccSupp{method=hex,unicode,ActualText=20AC}%
  Euro%
 \EndAccSupp{}%

 \BeginAccSupp{method=hex,unicode,ActualText=03B1}%
  alpha%
 \EndAccSupp{}%
 \end{document}

 --
 Ulrike Fischer



 --
 Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex




-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-19 Thread Karljurgen Feuerherm



Karljürgen G. Feuerherm, PhD
Undergraduate Advisor
Department of Archaeology and Classical Studies
Wilfrid Laurier University
75 University Avenue West
Waterloo, Ontario N2L 3C5
Tel. (519) 884-1970 x3193
Fax (519) 883-0991 (ATTN Arch.  Classics)




 On Sat, Nov 19, 2011 at  3:39 AM, in message 4ec76b33.2060...@rhul.ac.uk,
Philip TAYLOR p.tay...@rhul.ac.uk wrote: 
 
 I personally hope that this does not happen, and that before then
 we have an Omnicode consortium to review the mistakes of Unicode
 and to address them in a future, more orthogonal, more consistent,
 specification.

Hear, hear! (is that the right spelling?)

Wisdom is of course 20/20 hindsight--and the Omnicodists will make their own 
mistakes... it's inevitable. But still, one should try.

K




--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-19 Thread Keith J. Schultz
OUCH! I have been hit by a veteran truck drivers truck. ;-))

I concede! 

I am curious if many still know what a XX-bit word is. Is that term even still 
used?

Turn Unicode needs to be clean up it has become to fragmented.

regards
Keith.

Am 19.11.2011 um 09:39 schrieb Philip TAYLOR:

 
 
 Keith J. Schultz wrote:
 
  I do not think anybody disputes the fact that characters are not glyphs.
 
  The confusion arises that a character in CS is well defined and has a 
 history.
  To be more exact it is just one byte in size so that there can be only 
 256 characters.
 
 Sorry, Keith, this is patently untrue.  Replace is by was once and
 you get a little closer to the truth, but you still completely ignore
 issues such as the difference between (say) EBCDIC and ASCII.  CDC machines
 used a 60-bit word, and one character was six bits, not eight.  And before
 the advent of the extended character set, a character consisted of seven
 bits plus a parity bit, thus yielding at most 128 characters of which
 32 were reserved for control functions.
   
  The average user considers a glyph to be the same as a letter and 
 thereby a character.
 
 It is rarely safe to believe that one knows what the average user thinks ...
 
  Now, in order to process the glyphs with a computer it must be 
 decomposed back to unicode.
 
 But one rarely, if ever, processes glyphs; the glyphs are the end result,
 not the input.  Glyph processing does become necessary in languages such
 as Arabic, where context has a major impact on the way in which the
 individual glyphs are presented, but in Western languages the nearest we
 get to glyph processing is in the formation of ligature digraphs.
 
  How well this is done depends of the system its self. If the system is 
 not fully unicode aware and
  implements in properly then there will be problems. What adds to the 
 complexity of the problem is that
  not all fonts used for displaying unicode contain all code points, 
 Thereby, creating your many to many
  decomposition.
 
  As for getting junk when copying unicode, just copy between to text 
 using different fonts, where one font does
  not contain the glyph.
 
  The only true way to master this problem is if the computer world would 
 go completely full unicode with
  fonts support the full unicode code set!
 
 I personally hope that this does not happen, and that before then
 we have an Omnicode consortium to review the mistakes of Unicode
 and to address them in a future, more orthogonal, more consistent,
 specification.
 
 Philip Taylor
 
 
 --
 Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex




--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-19 Thread Keith J. Schultz

Am 19.11.2011 um 13:51 schrieb Zdenek Wagner:

 2011/11/19 Keith J. Schultz keithjschu...@web.de:
 
As for getting junk when copying unicode, just copy between to text 
 using different fonts, where one font does
not contain the glyph.
 
 When performing copypaste or text search in PDF, I am not interested
 in glyphs but in characters. I do not care what glyphs will be
 displayed. If I copy the text to OpenOffice, I can change the font
 later and if the codepoints were transferred correctly, I will see the
As you say if transferred correctly!

 text (it was true even with OpenOffice 1.x, I tried many years ago).
 If I copy the text to gedit, ontconfig will automatically find a font
 for displaying the characters not present in the current font. I still
 have to read the fontconfig manual in order to find how to configure
 its searching algorithm. Arabic fonts may be a problem especially if
 you wish to use Arabic, Persian and Urdu. Now I know that I have to
 force fontonfic to select automatically SIL Scheherezade because it
 contains all characters. I can thus use both U+0643 and U+06A. When
 writing Akbar, I can write it both in Arabic and in Urdu/Farsi

[snip, snip]

   The only advise I can give is choose your tools wisely.
 

regards
Keith.




--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-19 Thread Pander
On 2011-11-19 14:25, Keith J. Schultz wrote:

Perhaps this can be of use:
  https://github.com/wspr/fontspec/issues/121

 Am 19.11.2011 um 13:51 schrieb Zdenek Wagner:
 
 2011/11/19 Keith J. Schultz keithjschu...@web.de
 mailto:keithjschu...@web.de:

As for getting junk when copying unicode, just copy between to
 text using different fonts, where one font does
not contain the glyph.

 When performing copypaste or text search in PDF, I am not interested
 in glyphs but in characters. I do not care what glyphs will be
 displayed. If I copy the text to OpenOffice, I can change the font
 later and if the codepoints were transferred correctly, I will see the
 As you say if transferred correctly!
 
 text (it was true even with OpenOffice 1.x, I tried many years ago).
 If I copy the text to gedit, ontconfig will automatically find a font
 for displaying the characters not present in the current font. I still
 have to read the fontconfig manual in order to find how to configure
 its searching algorithm. Arabic fonts may be a problem especially if
 you wish to use Arabic, Persian and Urdu. Now I know that I have to
 force fontonfic to select automatically SIL Scheherezade because it
 contains all characters. I can thus use both U+0643 and U+06A. When
 writing Akbar, I can write it both in Arabic and in Urdu/Farsi
 
 [snip, snip]
 
   The only advise I can give is choose your tools wisely.

 
 regards
 Keith.
 
 
 
 
 
 
 --
 Subscriptions, Archive, and List information, etc.:
   http://tug.org/mailman/listinfo/xetex



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-19 Thread Zdenek Wagner
2011/11/19 Pander pan...@users.sourceforge.net:
 On 2011-11-19 14:25, Keith J. Schultz wrote:

 Perhaps this can be of use:
  https://github.com/wspr/fontspec/issues/121

As Khaled wrote, it belongs to the engine. ZWJ and ZWNJ are used in
Indic scripts and they work fine since I started to use XeTeX in 2008.

 Am 19.11.2011 um 13:51 schrieb Zdenek Wagner:

 2011/11/19 Keith J. Schultz keithjschu...@web.de
 mailto:keithjschu...@web.de:

        As for getting junk when copying unicode, just copy between to
 text using different fonts, where one font does
        not contain the glyph.

 When performing copypaste or text search in PDF, I am not interested
 in glyphs but in characters. I do not care what glyphs will be
 displayed. If I copy the text to OpenOffice, I can change the font
 later and if the codepoints were transferred correctly, I will see the
 As you say if transferred correctly!

 text (it was true even with OpenOffice 1.x, I tried many years ago).
 If I copy the text to gedit, ontconfig will automatically find a font
 for displaying the characters not present in the current font. I still
 have to read the fontconfig manual in order to find how to configure
 its searching algorithm. Arabic fonts may be a problem especially if
 you wish to use Arabic, Persian and Urdu. Now I know that I have to
 force fontonfic to select automatically SIL Scheherezade because it
 contains all characters. I can thus use both U+0643 and U+06A. When
 writing Akbar, I can write it both in Arabic and in Urdu/Farsi

 [snip, snip]

       The only advise I can give is choose your tools wisely.


 regards
 Keith.






 --
 Subscriptions, Archive, and List information, etc.:
   http://tug.org/mailman/listinfo/xetex



 --
 Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex




-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-19 Thread Chris Travers
On Sat, Nov 19, 2011 at 5:19 AM, Keith J. Schultz keithjschu...@web.de wrote:
 OUCH! I have been hit by a veteran truck drivers truck. ;-))

 I concede!

 I am curious if many still know what a XX-bit word is. Is that term even 
 still used?

It will fade out of use until someone decides we need 128-bit words
and then will pop in again ;-)

Best Wishes,
Chris Travers


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-18 Thread Keith J. Schultz
Hi Pihilip,

Thoughout, my programming life and experience I have learned
that internal structure means nothing, as long as the result is correct 
when it comes out.

As you rightfully point out the problem lies inside how TeX internally
handles space characters when adding them to its internal structure.

The fact is that initially, TeX was not designed to handle modern typesetting
well. (Xe)TeX's internals are partially quite outdated. It is possible to to 
handle
all this new type of spaces in (Xe)TeX, yet it is quite awkward and you have 
to be
a TeXchian to do it properly.

My personal opinion is that TeX et al. has to be revamped completely. Ideally, 
it should get 
a natural language parser as a front end and the typesetting module as its 
back-end for its
output.

Yes, I know this would not be TeX any more and require a complete different 
structure of the
TeX eco-system. Language modules and the like. I you care to discuss this we 
cam back channel
as it would be to OT, here.

regards
Keith.

Am 17.11.2011 um 20:56 schrieb Philip TAYLOR:

 Ross, I do not dispute your arguments : I was answering
 Keith's question in an honest way.  I (personally) do not
 think of a space in TeX output as a character at all,
 because I am steeped in TeX philosophy; but I am quite
 willing to accept that /if/ the objective is not to
 produce output for the sake of output, but output for
 subsequent processing as input by another program, then
 there /may/ be an argument for outputting a space as a
 variable-width glyph.
 
 However, I do think that what appears in the output stream
 is a secondary consideration; far more important (IMHO) is
 how we represent that space /within XeTeX/.  There is, I am
 sure, not a suggestion on the table that we start to treat
 a conventional space in XeTeX other than as TeX has traditionally
 treated it, and therefore the real question is (to my mind),
 do we adopt an extension of this traditional TeX treatment
 for non-breaking space, thin-space, and any of the other
 not-quite-standard spaces that Unicode encompasses, or do
 we look for an alternative model which /might/ be glyph-
 or character-based ?.




--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-18 Thread Zdenek Wagner
2011/11/18 Keith J. Schultz keithjschu...@web.de:
 Hi Pihilip,

 Thoughout, my programming life and experience I have learned
 that internal structure means nothing, as long as the result is correct
 when it comes out.

 As you rightfully point out the problem lies inside how TeX internally
 handles space characters when adding them to its internal structure.

 The fact is that initially, TeX was not designed to handle modern typesetting
 well. (Xe)TeX's internals are partially quite outdated. It is possible to to 
 handle
 all this new type of spaces in (Xe)TeX, yet it is quite awkward and you 
 have to be
 a TeXchian to do it properly.

 My personal opinion is that TeX et al. has to be revamped completely. 
 Ideally, it should get
 a natural language parser as a front end and the typesetting module as its 
 back-end for its
 output.

I admit that things could be done better than in nowadays TeX but its
complete revamping seems to me as bad investment. I would rather think
of an FO processor.

 Yes, I know this would not be TeX any more and require a complete different 
 structure of the
 TeX eco-system. Language modules and the like. I you care to discuss this we 
 cam back channel
 as it would be to OT, here.

 regards
        Keith.

 Am 17.11.2011 um 20:56 schrieb Philip TAYLOR:

 Ross, I do not dispute your arguments : I was answering
 Keith's question in an honest way.  I (personally) do not
 think of a space in TeX output as a character at all,
 because I am steeped in TeX philosophy; but I am quite
 willing to accept that /if/ the objective is not to
 produce output for the sake of output, but output for
 subsequent processing as input by another program, then
 there /may/ be an argument for outputting a space as a
 variable-width glyph.

 However, I do think that what appears in the output stream
 is a secondary consideration; far more important (IMHO) is
 how we represent that space /within XeTeX/.  There is, I am
 sure, not a suggestion on the table that we start to treat
 a conventional space in XeTeX other than as TeX has traditionally
 treated it, and therefore the real question is (to my mind),
 do we adopt an extension of this traditional TeX treatment
 for non-breaking space, thin-space, and any of the other
 not-quite-standard spaces that Unicode encompasses, or do
 we look for an alternative model which /might/ be glyph-
 or character-based ?.




 --
 Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex




-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-18 Thread Philip TAYLOR



Zdenek Wagner wrote:


I admit that things could be done better than in nowadays TeX but its
complete revamping seems to me as bad investment. I would rather think
of an FO processor.


And I agree with Zdeněk : this discussion will be productive only
if we focus on what can be accomplished (w.r.t. spaces) with few
or no changes to XeTeX, not on how we might best deal with the
whole (intellectually daunting) issue of optimally typesetting Unicode.

** Phil.


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-18 Thread Ulrike Fischer
Am Fri, 18 Nov 2011 08:31:28 +1100 schrieb Ross Moore:

 Yes, that's the point. The goal of TeX is nice typographical
 appearance. The goal of XML is easy data exchange. If I want to send
 structured data, I send XML, not PDF.
 
 These days people want both.

One question which pops up regularly in the TeX-groups is how can I
insert a code listing in my pdf so that it can be copied and pasted
reliably. 

Currently this is not easy as the heuristics of the readers can
easily loose spaces, you can't encode tabs or a specific number of
spaces. 

Real space characters in the pdf (instead of only visible space)
would help here a lot.


-- 
Ulrike Fischer 



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-18 Thread Philip TAYLOR

Is it safe to assume that these code listings
are restricted to the ASCII character set ?  If
so, yes, spaces are likely to be a problem, but
if the code listing can also include ligature-
digraphs, then these are likely to prove even
more problematic.

** Phipl.

Ulrike Fischer wrote:


One question which pops up regularly in the TeX-groups is how can I
insert a code listing in my pdf so that it can be copied and pasted
reliably.

Currently this is not easy as the heuristics of the readers can
easily loose spaces, you can't encode tabs or a specific number of
spaces.

Real space characters in the pdf (instead of only visible space)
would help here a lot.



--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-18 Thread Zdenek Wagner
2011/11/18 Philip TAYLOR p.tay...@rhul.ac.uk:
 Is it safe to assume that these code listings
 are restricted to the ASCII character set ?  If
 so, yes, spaces are likely to be a problem, but
 if the code listing can also include ligature-
 digraphs, then these are likely to prove even
 more problematic.

If the code listing is typeset in a fixed width font, it is usually no
problem. I copied a few code samples from books in PDF, most of them
were typeset by TeX. If I want to copy text in Devanagari, it is
almost impossible. If I take just a simple Hindi work किताब, the best
result I can get will be िकताब (you should se a dotted circle which is
not visible in PDF). The reason is that the first two letters are
U+0915, U+093F but visually the latter is displayed first. After
copying you get the reversed order U+093F, U+0915. This is just one of
many problems with Devanagari. The toUnicode map does not help much
with Indian scripts. I have never tried to copy Arabic from PDF. Or
even the combination of LTR and RTL within a paragraph.

 ** Phipl.
 
 Ulrike Fischer wrote:

 One question which pops up regularly in the TeX-groups is how can I
 insert a code listing in my pdf so that it can be copied and pasted
 reliably.

 Currently this is not easy as the heuristics of the readers can
 easily loose spaces, you can't encode tabs or a specific number of
 spaces.

 Real space characters in the pdf (instead of only visible space)
 would help here a lot.


 --
 Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex




-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-18 Thread maxwell
On Fri, 18 Nov 2011 13:52:56 +0100, Zdenek Wagner
zdenek.wag...@gmail.com
wrote:
 2011/11/18 Philip TAYLOR p.tay...@rhul.ac.uk:
 Is it safe to assume that these code listings
 are restricted to the ASCII character set ?  If
 so, yes, spaces are likely to be a problem, but
 if the code listing can also include ligature-
 digraphs, then these are likely to prove even
 more problematic.

 If the code listing is typeset in a fixed width font, it is usually no
 problem. I copied a few code samples from books in PDF, most of them
 were typeset by TeX. If I want to copy text in Devanagari, it is
 almost impossible. 

Besides TeX, Dr. Knuth also invented Literate Programming.  In our own
project, we use LP to extract the code listings from the original source
code, rather than from the PDF.  One advantage is that in addition to the
re-ordering at the character level (mentioned in part of Zdenek's email
that I didn't copy over), this allows re-ordering at any arbitrary level,
even entire sections of program code.  (We happen to be using XML to
contain the source of both our text and our programming language
constructs, but that's a different issue.)

I agree that it would be nice to be able to reliably copy Unicode text
from the PDF, but (a) that issue isn't confined to program listings, and
(b) that would only solve the character ordering part of the problem.

   Mike Maxwell


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-18 Thread Zdenek Wagner
2011/11/18 maxwell maxw...@umiacs.umd.edu:
 On Fri, 18 Nov 2011 13:52:56 +0100, Zdenek Wagner
 zdenek.wag...@gmail.com
 wrote:
 2011/11/18 Philip TAYLOR p.tay...@rhul.ac.uk:
 Is it safe to assume that these code listings
 are restricted to the ASCII character set ?  If
 so, yes, spaces are likely to be a problem, but
 if the code listing can also include ligature-
 digraphs, then these are likely to prove even
 more problematic.

 If the code listing is typeset in a fixed width font, it is usually no
 problem. I copied a few code samples from books in PDF, most of them
 were typeset by TeX. If I want to copy text in Devanagari, it is
 almost impossible.

 Besides TeX, Dr. Knuth also invented Literate Programming.  In our own
 project, we use LP to extract the code listings from the original source
 code, rather than from the PDF.  One advantage is that in addition to the
 re-ordering at the character level (mentioned in part of Zdenek's email
 that I didn't copy over), this allows re-ordering at any arbitrary level,

This is a demonstration that glyphs are not the same as characters. I
will startt with a simpler case and will not put Devanagari to the
mail message. If you wish to write a syllable RU, you have to add a
dependent vowel (matra) U to a consonant RA. There is a ligature RU,
so in PDF you will not see RA consonant with U matra but a RU glyph.
Similarly, TRA is a single glyph representing the following
characters: TA+VIRAMA+RA. The toUnicode map supports 1:1 and 1:many
mappings thus it is possible to handle these cases when copying text
from a PDF or when searching. More difficult case is I matra (short
dependent vowel I). As a character it must always follow a consonant
(this is a general rule for all dependent vowels) but visually (as a
glyph) it precedes the consonant group after which it is pronounced.
The sample word was kitab (it means a book). In Unicode (as
characters) the order is KA+I-matra+TA+A-matra(long)+BA. Visually
I-matra precedes KA. XeTeX (knowing that it works with a Devanagari
script) runs the character sequence through ICU and the result is the
glyph sequence. The original sequence is lost so that when the text is
copied from PDF, we get (not exactly) i*katab. Microsoft suggested
what additional characters should appear in Indic OpenType fonts. One
of them is a dotted ring which denotes a missing consonant. I-matra
must always follow a consonant (in character order). If it is moved to
the beginning of a word, it is wrong. If you paste it to a text
editor, the OpenType rendering engine should display a missing
consonant as a dotted ring (if it is present in the font). In
character order the dotted ring will precede I-matra but in visual
(glyph) order it will be just opposite. Thus the asterisk shows the
place where you will see the dotted circle. This is just one simple
case. I-matra may follow a consonant group, such as in word PRIY
(dear) which is PA+VIRAMA+RA+I-matra+YA or STRIYOCIT (good for women)
which is SA+VIRAMA+TA+VIRAMA+RA+I-matra+YA+O-matra+CA+I-matra+TA. Both
words will start with the I-matra glyph. The latter will contain two
ordering bugs after copypaste. Consider also word MURTI (statue)
which is a sequence of characters
MA+U-matra(long)+RA+VIRAMA+TA+I-matra. Visually the long U-matra will
appear as an accent below the MA glyph. The next glyph will be I-matra
followed by TA followed by RA shown as an upper accent at the right
edge of the syllable. Generally in RA+VIRAMA+consonant+matra the RA
glyph appears at the end of the syllable although locically (in
character order) it belongs to the beginning. These cases cannot be
solved by toUnicode map because many-to-many mappings are not allowed.
Moreover, a huge amount of mappings will be needed. It would be better
to do the reverse processing independent of toUnicode mappings, to use
ICU or Pango or Uniscribe or whatever to analyze the glyphs and
convert them to characters. The rules are unambiguous but AR does not
do it.

We discuss nonbreakable spaces while we are not yet able to convert
properly printable glyphs to characters when doing copypaste from
PDF...


-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-18 Thread Ross Moore
Hi Zdenek,

On 19/11/2011, at 9:51 AM, Zdenek Wagner wrote:

 This is a demonstration that glyphs are not the same as characters. I
 will startt with a simpler case and will not put Devanagari to the
 mail message. If you wish to write a syllable RU, you have to add a
 dependent vowel (matra) U to a consonant RA. There is a ligature RU,
 so in PDF you will not see RA consonant with U matra but a RU glyph.
 Similarly, TRA is a single glyph representing the following
 characters: TA+VIRAMA+RA. The toUnicode map supports 1:1 and 1:many
 mappings thus it is possible to handle these cases when copying text
 from a PDF or when searching. More difficult case is I matra (short
 dependent vowel I). As a character it must always follow a consonant
 (this is a general rule for all dependent vowels) but visually (as a
 glyph) it precedes the consonant group after which it is pronounced.
 The sample word was kitab (it means a book). In Unicode (as
 characters) the order is KA+I-matra+TA+A-matra(long)+BA. Visually
 I-matra precedes KA. XeTeX (knowing that it works with a Devanagari
 script) runs the character sequence through ICU and the result is the
 glyph sequence. The original sequence is lost so that when the text is
 copied from PDF, we get (not exactly) i*katab.

/ActualText is your friend here.
You tag the content and provide the string that you want to appear
with Copy/Paste as the value associated to a dictionary key.

There is a macro package that can do this with pdfTeX, and it is 
a vital part of my Tagged PDF work for mathematics.
Also, I have an example where the CJK.sty package is extended
to tag Chinese characters built from multiple glyphs so that
Copy/Paste works correctly (modulo PDF reader quirks).

Not sure about XeTeX.

I once tried to talk with Jonathan Kew about what would be needed 
to implement this properly, but he got totally the wrong idea 
concerning glyphs and characters, and what was needed to be done
internally and what by macros. The conversation went nowhere.

 Microsoft suggested
 what additional characters should appear in Indic OpenType fonts. One
 of them is a dotted ring which denotes a missing consonant. I-matra
 must always follow a consonant (in character order). If it is moved to
 the beginning of a word, it is wrong. If you paste it to a text
 editor, the OpenType rendering engine should display a missing
 consonant as a dotted ring (if it is present in the font). In
 character order the dotted ring will precede I-matra but in visual
 (glyph) order it will be just opposite. Thus the asterisk shows the
 place where you will see the dotted circle. This is just one simple
 case. I-matra may follow a consonant group, such as in word PRIY
 (dear) which is PA+VIRAMA+RA+I-matra+YA or STRIYOCIT (good for women)
 which is SA+VIRAMA+TA+VIRAMA+RA+I-matra+YA+O-matra+CA+I-matra+TA. Both
 words will start with the I-matra glyph. The latter will contain two
 ordering bugs after copypaste. Consider also word MURTI (statue)
 which is a sequence of characters

This sounds like each word needs its own /ActualText .
So some intricate programming is certainly necessary.
But \XeTeXinterchartoks  (is that the right spelling?)
should make this possible.

 MA+U-matra(long)+RA+VIRAMA+TA+I-matra. Visually the long U-matra will
 appear as an accent below the MA glyph. The next glyph will be I-matra
 followed by TA followed by RA shown as an upper accent at the right
 edge of the syllable. Generally in RA+VIRAMA+consonant+matra the RA
 glyph appears at the end of the syllable although locically (in
 character order) it belongs to the beginning. These cases cannot be
 solved by toUnicode map because many-to-many mappings are not allowed.

Agreed.  /ToUnicode  is not the right PDF construction for this.

 Moreover, a huge amount of mappings will be needed. It would be better
 to do the reverse processing independent of toUnicode mappings, to use
 ICU or Pango or Uniscribe or whatever to analyze the glyphs and
 convert them to characters. The rules are unambiguous but AR does not
 do it.

Having an external pre-procesor is what I do for tagging mathematics.
It seems like a similarly intricate problem here.

 
 We discuss nonbreakable spaces while we are not yet able to convert
 properly printable glyphs to characters when doing copypaste from
 PDF...

  :-)

 
 
 -- 
 Zdeněk Wagner
 http://hroch486.icpf.cas.cz/wagner/
 http://icebearsoft.euweb.cz

Hope this helps,

Ross


Ross Moore   ross.mo...@mq.edu.au 
Mathematics Department   office: E7A-419  
Macquarie University tel: +61 (0)2 9850 8955
Sydney, Australia  2109  fax: +61 (0)2 9850 8114







--
Subscriptions, Archive, and List information, 

Re: [XeTeX] Whitespace in input

2011-11-18 Thread Zdenek Wagner
2011/11/19 Ross Moore ross.mo...@mq.edu.au:
 Hi Zdenek,

 On 19/11/2011, at 9:51 AM, Zdenek Wagner wrote:

 This is a demonstration that glyphs are not the same as characters. I
 will startt with a simpler case and will not put Devanagari to the
 mail message. If you wish to write a syllable RU, you have to add a
 dependent vowel (matra) U to a consonant RA. There is a ligature RU,
 so in PDF you will not see RA consonant with U matra but a RU glyph.
 Similarly, TRA is a single glyph representing the following
 characters: TA+VIRAMA+RA. The toUnicode map supports 1:1 and 1:many
 mappings thus it is possible to handle these cases when copying text
 from a PDF or when searching. More difficult case is I matra (short
 dependent vowel I). As a character it must always follow a consonant
 (this is a general rule for all dependent vowels) but visually (as a
 glyph) it precedes the consonant group after which it is pronounced.
 The sample word was kitab (it means a book). In Unicode (as
 characters) the order is KA+I-matra+TA+A-matra(long)+BA. Visually
 I-matra precedes KA. XeTeX (knowing that it works with a Devanagari
 script) runs the character sequence through ICU and the result is the
 glyph sequence. The original sequence is lost so that when the text is
 copied from PDF, we get (not exactly) i*katab.

 /ActualText is your friend here.
 You tag the content and provide the string that you want to appear
 with Copy/Paste as the value associated to a dictionary key.

I do not know whether the PDF specification has evolved since I read
it the last time. /ActualText allows only single-byte characters, ie
those with codes between 0 and 255, not arbitrary Unicode characters.
/ActualText is demonstrated on German hyphenated words such as Zucker
which is hyphenated as Zuk- ker. I have tried to put /ActualText
manually via a special, I could see it in the PDF file but it did not
work.

When converting a white space to a space character some [complex]
heuristics is needed while proper conversion of glyphs to characters
of Indic scripts require just a few strict rules. The ligatures as TRA
have to appear in the toUnicode map, otherwise its meaning will be
unclear. If you see the I-matra, go to the last consonant in the
sequence and put the I-matra character there. If you see the RA glyph
at the right edge of a syllable, go back to the leftmost consonant in
the group and prepend RA+VIRAMA there. This is all what has to be done
with Devanagari. Other Indic scripts contain two-part vowels but the
rules will be similarly simple. We should not be forced to double the
size of the PDF file. AR and other PDF rendering programs should learn
these simple rules and use them when extracting text.

 There is a macro package that can do this with pdfTeX, and it is
 a vital part of my Tagged PDF work for mathematics.
 Also, I have an example where the CJK.sty package is extended
 to tag Chinese characters built from multiple glyphs so that
 Copy/Paste works correctly (modulo PDF reader quirks).

 Not sure about XeTeX.

 I once tried to talk with Jonathan Kew about what would be needed
 to implement this properly, but he got totally the wrong idea
 concerning glyphs and characters, and what was needed to be done
 internally and what by macros. The conversation went nowhere.

 Microsoft suggested
 what additional characters should appear in Indic OpenType fonts. One
 of them is a dotted ring which denotes a missing consonant. I-matra
 must always follow a consonant (in character order). If it is moved to
 the beginning of a word, it is wrong. If you paste it to a text
 editor, the OpenType rendering engine should display a missing
 consonant as a dotted ring (if it is present in the font). In
 character order the dotted ring will precede I-matra but in visual
 (glyph) order it will be just opposite. Thus the asterisk shows the
 place where you will see the dotted circle. This is just one simple
 case. I-matra may follow a consonant group, such as in word PRIY
 (dear) which is PA+VIRAMA+RA+I-matra+YA or STRIYOCIT (good for women)
 which is SA+VIRAMA+TA+VIRAMA+RA+I-matra+YA+O-matra+CA+I-matra+TA. Both
 words will start with the I-matra glyph. The latter will contain two
 ordering bugs after copypaste. Consider also word MURTI (statue)
 which is a sequence of characters

 This sounds like each word needs its own /ActualText .
 So some intricate programming is certainly necessary.
 But \XeTeXinterchartoks  (is that the right spelling?)
 should make this possible.

 MA+U-matra(long)+RA+VIRAMA+TA+I-matra. Visually the long U-matra will
 appear as an accent below the MA glyph. The next glyph will be I-matra
 followed by TA followed by RA shown as an upper accent at the right
 edge of the syllable. Generally in RA+VIRAMA+consonant+matra the RA
 glyph appears at the end of the syllable although locically (in
 character order) it belongs to the beginning. These cases cannot be
 solved by toUnicode map because many-to-many mappings are not 

Re: [XeTeX] Whitespace in input

2011-11-18 Thread Ross Moore
Hi Zdenek,

On 19/11/2011, at 10:30 AM, Zdenek Wagner wrote:

 /ActualText is your friend here.
 You tag the content and provide the string that you want to appear
 with Copy/Paste as the value associated to a dictionary key.
 
 I do not know whether the PDF specification has evolved since I read
 it the last time. /ActualText allows only single-byte characters, ie
 those with codes between 0 and 255, not arbitrary Unicode characters.

That is most certainly not true.
You code up UTF-16BE as Hex strings.

Here is a snippet of the (tagged-pdfLaTeX) source coding from 
the main example that I showed in my  TUG2011 talk. 
The URL for the video of the talk is given in several of my previous emails:

\SMC attr{/ActualTextFEFFD835DC4F\TPDFaloud{1D44F}} noendtext 254 {mi}%
  b%
_{\noEMC%
   \TPDFsub 
\SMC attr{/ActualTextFEFFD835DC58\TPDFaloud{1D458}} noendtext 255 {mi}%
  k%
\EMC 
  }^{\EMC 
\SMC attr{/ActualText( )} noendtext 256 {Span}%
  \pdffakespace
\EMC 
  }%
\TPDFpopbrack 
\SMC attr{/ActualTextFEFF0029\TPDFaloud{0029}} noendtext 257 {mo}%
  \Bigr)%


Inside the resulting PDF, this content looks like:

 1 0 0 1 4.902 2.463 cm
 /mi /MCID 10 /ActualTextFEFFD835DC4F/Alt(  , b ,  )
 BDC
 BT
 /F11 9.9626 Tf
  [(b)]TJ
 ET
 EMC
 1 0 0 1 4.276 4.114 cm
 /Span /MCID 11 /ActualText( )
 BDC
 BT
 /F103 1 Tf
  [( )]TJ
 ET
 EMC
 1 0 0 1 0 -6.577 cm
 /mi /MCID 12 /ActualTextFEFFD835DC58/Alt(  sub k ,  )
 BDC
 BT
 /F10 6.9738 Tf
  [(k)]TJ
 ET
 EMC
 1 0 0 1 4.901 2.463 cm
 /mo /MCID 13 /Alt(  close bracket:,   , )
 BDC


The full PDF passes all of Adobe's validation tests for
correct PDF syntax, Accessible Content, PDF/A-1b compliance.

More particularly:
 
  /mi /MCID 10 /ActualTextFEFFD835DC4F/Alt(  , b ,  )
  BDC
  BT
  /F11 9.9626 Tf
   [(b)]TJ
  ET
  EMC

expresses a math-italic 'b' as :

 1.  the glyph in the position of letter 'b' (in CMMI10  font);

 2.  to be spoken aloud as   , b ,   where commas indicate a slight pause

 3.  to Copy/Paste as the surrogate pair  Ux0D835 Ux0DC4F
  equivalent to a Plane-1 math-italic character 'b' .

The /MCID key is necessary for tagged PDF, but the /Alt and /ActualText
should work independently to full tagging.
The '/mi' is immaterial; it could equally well be  '/Span'. 


 /ActualText is demonstrated on German hyphenated words such as Zucker
 which is hyphenated as Zuk- ker. I have tried to put /ActualText
 manually via a special, I could see it in the PDF file but it did not
 work.

Yes, because it is quite important to position the tagging pieces
correctly within the PDF content stream. It has to balance correctly
with BT ... ET  and the BDC ... EMC  operator pairs, and there may
be other subtle requirements.

Certainly it cannot be done with just a single \special .
There needs to be stuff both before and after the content
that causes actual glyphs to be displayed.


Just using \pdfliteral  is not sufficient with pdfTeX; we needed
a special modification that allowed the  /mi ...BDC 
and  EMC to fit snuggly around the  BT ... ET .

There could be a similar problem with XeTeX's 
 \special{pdf:literal ... }  
(or whatever is the syntax).
This is the issue that I was trying to discuss with JK in 2009 or 2010.


 
 When converting a white space to a space character some [complex]
 heuristics is needed while proper conversion of glyphs to characters
 of Indic scripts require just a few strict rules. The ligatures as TRA
 have to appear in the toUnicode map, otherwise its meaning will be
 unclear. If you see the I-matra, go to the last consonant in the
 sequence and put the I-matra character there. If you see the RA glyph
 at the right edge of a syllable, go back to the leftmost consonant in
 the group and prepend RA+VIRAMA there. This is all what has to be done
 with Devanagari. Other Indic scripts contain two-part vowels but the
 rules will be similarly simple. We should not be forced to double the
 size of the PDF file. AR and other PDF rendering programs should learn
 these simple rules and use them when extracting text.

If you can provide the  UTF-16BE Hex representation of these,
I can create a PDF using it as the /ActualText  replacement for 
some arbitrary string of letters.

This will test whether this is a viable approach for Devanagari.
If so, then it is a matter of working out how to expand this
for a full solution.


 
 There is a macro package that can do this with pdfTeX, and it is
 a vital part of my Tagged PDF work for mathematics.
 Also, I have an example where the CJK.sty package is extended
 to tag Chinese characters built from multiple glyphs so that
 Copy/Paste works correctly (modulo PDF reader quirks).
 
 Not sure about XeTeX.
 
 I once tried to talk with Jonathan Kew about what would be needed
 to implement this properly, but he got totally the wrong idea
 concerning glyphs and characters, and what was needed to be done
 internally and what by macros. The conversation went nowhere.

 -- 
 Zdeněk Wagner


Cheers,


Re: [XeTeX] Whitespace in input

2011-11-17 Thread Keith J. Schultz
O.K.

You mention in a later post that you do consider a space as a printable 
character.
I do disagree, in the sense that, even though you actually can not see how many 
spaces are in a run,
that it does have a size and thereby does have a fixed visual affect.

I do agree with you, that a space character should, in good layout, be changed 
to a space of white to
accommodate good line breaking. So it is not truly a printable character in 
text layout.

Though, I do prefer inter character spacing a preferable method to achieve a 
more aesthetically look.

Know more to point.

Often enough there are conventions that one has to follow concerning the 
wrapping of words. Most
prominent Names. 

As an example I will use my name Keith J. Schultz. (Yes, this is not the best 
example and (Xe)Tex has ways
of getting around this) Names should not be wrap or should there not be 
unnecessary space between the parts.
Generally, it is O.K. to wrap/line break after the J., but not between Keith 
and J. so I need a non breaking space between
them, also you do not want different space between Keith, J. and Schultz, 
yet not the same space as used between other
words of the line. If the J. bothers you use Johan instead. The same is 
true of Mrs. Smith.

So the use of a non breaking space with given size is advisable for input. Of 
course, what TeX et al. should output is debatable
and it wreaks havoc with TeX's line breaking algorithm.

It is often hard to get the desired results. But, the way TeX works this will 
always be a problem.
Yet, when I enter a non-breaking space that is what I want and more often than 
not a space of
fixed size across the board. 

regards
Keith.




Am 15.11.2011 um 12:09 schrieb Philip and Le Khanh:

 
 
 Keith J. Schultz wrote:
 
 A non.breaking space is to me a printable character, in so far that
 it is important and must be used to distinguish between word space, et all.
 
 If, for you, [a] non.breaking space is a printable character, then
 presumably that character must be taken from some font.  If you take
 a character from a font, it will have a size, and although it can be
 combined with kerning rules to adjust its position w.r.t. adjacent
 characters,  the logic for this is fairly restricted.  In particular,
 it cannot take into account the amount by which TeX is seeking to
 expand or contract spaces on the current line in order to achieve
 optimal paragraphs.  So in your model of the ideal universe, non-breaking 
 Unicode spaces would not behave as do conventional
 TeX non-breaking spaces (which /do/ expand and contract to assist
 in TeX's line-breaking), nor would they conform to their Unicode
 definition where their decomposition is defined as :
 
   noBreak SPACE (U+0020)
 
 I wonder if you would like to discuss these points ?
 
 Philip Taylor




--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-17 Thread Keith J. Schultz

Am 17.11.2011 um 11:26 schrieb Keith J. Schultz:

 O.K.
 
 You mention in a later post that you do consider a space as a printable 
 character.
This line should read as:
 You mention in a later post that you consider a space as a 
non-printable character.

 I do disagree, in the sense that, even though you actually can not see how 
 many spaces are in a run,
 that it does have a size and thereby does have a fixed visual affect.
 
 I do agree with you, that a space character should, in good layout, be 
 changed to a space of white to
 accommodate good line breaking. So it is not truly a printable character in 
 text layout.
 
 Though, I do prefer inter character spacing a preferable method to achieve a 
 more aesthetically look.
 
 Know more to point.
 
 Often enough there are conventions that one has to follow concerning the 
 wrapping of words. Most
 prominent Names. 
 
 As an example I will use my name Keith J. Schultz. (Yes, this is not the best 
 example and (Xe)Tex has ways
 of getting around this) Names should not be wrap or should there not be 
 unnecessary space between the parts.
 Generally, it is O.K. to wrap/line break after the J., but not between 
 Keith and J. so I need a non breaking space between
 them, also you do not want different space between Keith, J. and 
 Schultz, yet not the same space as used between other
 words of the line. If the J. bothers you use Johan instead. The same is 
 true of Mrs. Smith.
 
 So the use of a non breaking space with given size is advisable for input. Of 
 course, what TeX et al. should output is debatable
 and it wreaks havoc with TeX's line breaking algorithm.
 
 It is often hard to get the desired results. But, the way TeX works this will 
 always be a problem.
 Yet, when I enter a non-breaking space that is what I want and more often 
 than not a space of
 fixed size across the board. 
 
 regards
   Keith.
 
 
 
 
 Am 15.11.2011 um 12:09 schrieb Philip and Le Khanh:
 
 
 
 Keith J. Schultz wrote:
 
 A non.breaking space is to me a printable character, in so far that
 it is important and must be used to distinguish between word space, et all.
 
 If, for you, [a] non.breaking space is a printable character, then
 presumably that character must be taken from some font.  If you take
 a character from a font, it will have a size, and although it can be
 combined with kerning rules to adjust its position w.r.t. adjacent
 characters,  the logic for this is fairly restricted.  In particular,
 it cannot take into account the amount by which TeX is seeking to
 expand or contract spaces on the current line in order to achieve
 optimal paragraphs.  So in your model of the ideal universe, non-breaking 
 Unicode spaces would not behave as do conventional
 TeX non-breaking spaces (which /do/ expand and contract to assist
 in TeX's line-breaking), nor would they conform to their Unicode
 definition where their decomposition is defined as :
 
  noBreak SPACE (U+0020)
 
 I wonder if you would like to discuss these points ?
 
 Philip Taylor
 
 
 
 
 --
 Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex




--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-17 Thread Philip TAYLOR



Keith J. Schultz wrote:


Am 17.11.2011 um 11:26 schrieb Keith J. Schultz:


O.K.

You mention in a later post that you do consider a space as a printable 
character.

This line should read as:
  You mention in a later post that you consider a space as a 
non-printable character.


No, I don't think of it as a character at all, when we are talking
about typeset output (as opposed to ASCII (or Unicode) input).  Clearly
it is a character on input, but unless it generates a glyph in the
output stream (which TeX does not, for normal spaces) then it is not
a character (/qua/ character) on output but rather a formatting
instruction not dissimilar to (say) end-of-line.

** Phil.


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-17 Thread Ross Moore
Hi Phil,

On 17/11/2011, at 23:53, Philip TAYLOR p.tay...@rhul.ac.uk wrote:

 Keith J. Schultz wrote:
 
 You mention in a later post that you do consider a space as a printable 
 character.
This line should read as:
  You mention in a later post that you consider a space as a 
 non-printable character.
 
 No, I don't think of it as a character at all, when we are talking
 about typeset output (as opposed to ASCII (or Unicode) input).  

This is fine, when all that you require of your output is that it be visible on
a printed page. But modern communication media goes much beyond that.
A machine needs to be able to tell where words and lines end, reflowing 
paragraphs when appropriate and able to produce a flat extraction of all the 
text, perhaps also with some indication of the purpose of that text (e.g. by 
structural tagging).

In short, what is output for one format should also be able to serve as input 
for another.

Thus the space certainly does play the role of an output character – though the 
presence of a gap in the positioning of visible letters may serve this role in 
many, but not all, circumstances.

 Clearly
 it is a character on input, but unless it generates a glyph in the
 output stream (which TeX does not, for normal spaces) then it is not
 a character (/qua/ character) on output but rather a formatting
 instruction not dissimilar to (say) end-of-line.

But a formatting instruction for one program cannot serve as reliable input for 
another.
A heuristic is then needed, to attempt to infer that a programming instruction 
must have been used, and guess what kind of instruction it might have been. 
This is not 100% reliable, so is deprecated in modern methods of data storage 
and document formats.
XML based formats use tagging, rather that programming instructions. This is 
the modern way, which is used extensively for communicating data between 
different software systems.

 
 ** Phil.

TeX's strength is in its superior ability to position characters on the page 
for maximum visual effect. This is done by producing detailed programming 
instructions within the content stream of the PDF output. However, this is not 
enough to meet the needs of formats such as EPUB, non-visual reading software, 
archival formats, searchability, and other needs.
Tagged PDF can be viewed as Adobe's response to address these requirements as 
an extension of the visual aspects of the PDF format. It is a direction in 
which TeX can (and surely must) move, to stay relevant within the publishing 
industry of the future.


Hope this helps,

 Ross

--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-17 Thread Philip TAYLOR

Ross, I do not dispute your arguments : I was answering
Keith's question in an honest way.  I (personally) do not
think of a space in TeX output as a character at all,
because I am steeped in TeX philosophy; but I am quite
willing to accept that /if/ the objective is not to
produce output for the sake of output, but output for
subsequent processing as input by another program, then
there /may/ be an argument for outputting a space as a
variable-width glyph.

However, I do think that what appears in the output stream
is a secondary consideration; far more important (IMHO) is
how we represent that space /within XeTeX/.  There is, I am
sure, not a suggestion on the table that we start to treat
a conventional space in XeTeX other than as TeX has traditionally
treated it, and therefore the real question is (to my mind),
do we adopt an extension of this traditional TeX treatment
for non-breaking space, thin-space, and any of the other
not-quite-standard spaces that Unicode encompasses, or do
we look for an alternative model which /might/ be glyph-
or character-based ?.

** Phil.


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-17 Thread Zdenek Wagner
2011/11/17 Ross Moore ross.mo...@mq.edu.au:
 Hi Phil,
 On 17/11/2011, at 23:53, Philip TAYLOR p.tay...@rhul.ac.uk wrote:

 Keith J. Schultz wrote:

 You mention in a later post that you do consider a space as a printable
 character.

This line should read as:

  You mention in a later post that you consider a space as a
 non-printable character.

 No, I don't think of it as a character at all, when we are talking
 about typeset output (as opposed to ASCII (or Unicode) input).

 This is fine, when all that you require of your output is that it be visible
 on
 a printed page. But modern communication media goes much beyond that.
 A machine needs to be able to tell where words and lines end, reflowing
 paragraphs when appropriate and able to produce a flat extraction of all the
 text, perhaps also with some indication of the purpose of that text (e.g. by
 structural tagging).
 In short, what is output for one format should also be able to serve as
 input for another.
 Thus the space certainly does play the role of an output character - though
 the presence of a gap in the positioning of visible letters may serve this
 role in many, but not all, circumstances.

 Clearly
 it is a character on input, but unless it generates a glyph in the
 output stream (which TeX does not, for normal spaces) then it is not
 a character (/qua/ character) on output but rather a formatting
 instruction not dissimilar to (say) end-of-line.

 But a formatting instruction for one program cannot serve as reliable input
 for another.
 A heuristic is then needed, to attempt to infer that a programming
 instruction must have been used, and guess what kind of instruction it might
 have been. This is not 100% reliable, so is deprecated in modern methods of
 data storage and document formats.
 XML based formats use tagging, rather that programming instructions. This is
 the modern way, which is used extensively for communicating data between
 different software systems.

Yes, that's the point. The goal of TeX is nice typographical
appearance. The goal of XML is easy data exchange. If I want to send
structured data, I send XML, not PDF.

 ** Phil.

 TeX's strength is in its superior ability to position characters on the page
 for maximum visual effect. This is done by producing detailed programming
 instructions within the content stream of the PDF output. However, this is
 not enough to meet the needs of formats such as EPUB, non-visual reading
 software, archival formats, searchability, and other needs.
 Tagged PDF can be viewed as Adobe's response to address these requirements
 as an extension of the visual aspects of the PDF format. It is a direction
 in which TeX can (and surely must) move, to stay relevant within the
 publishing industry of the future.

 Hope this helps,
  Ross

No, it does not help. Remember that tha last (almost) portable version
of PDF is 1.2. If you are to open tagged PDF or even PDF with a
toUnicode map or a colorspace other than RGB or CMYK in Acrobat Reader
3, it displays a fatal error and dies. I reported it to Adobe in March
2001 and they did nothing. I even reported another fatal bug in
January 2001. I sent sample files but nothing happened, Adobe just
stopped development of Acrobat Reader at buggy version 3 for some
operating systems. Why do you so much rely on Adobe? When exchanging
structured documents I will always do it in XML and never create
tagged PDF because I know that some users will be unable to read them
by Adobe Acrobat Reader. I do not wish to make them dependent on
ghostscript and similar tools.

 --
 Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex





-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-17 Thread Ross Moore
Hello Zdenek,

On 18/11/2011, at 7:49 AM, Zdenek Wagner wrote:

 But a formatting instruction for one program cannot serve as reliable input
 for another.
 A heuristic is then needed, to attempt to infer that a programming
 instruction must have been used, and guess what kind of instruction it might
 have been. This is not 100% reliable, so is deprecated in modern methods of
 data storage and document formats.
 XML based formats use tagging, rather that programming instructions. This is
 the modern way, which is used extensively for communicating data between
 different software systems.
 
 Yes, that's the point. The goal of TeX is nice typographical
 appearance. The goal of XML is easy data exchange. If I want to send
 structured data, I send XML, not PDF.

These days people want both.

 
 ** Phil.
 
 TeX's strength is in its superior ability to position characters on the page
 for maximum visual effect. This is done by producing detailed programming
 instructions within the content stream of the PDF output. However, this is
 not enough to meet the needs of formats such as EPUB, non-visual reading
 software, archival formats, searchability, and other needs.
 Tagged PDF can be viewed as Adobe's response to address these requirements
 as an extension of the visual aspects of the PDF format. It is a direction
 in which TeX can (and surely must) move, to stay relevant within the
 publishing industry of the future.
 
 Hope this helps,
 Ross
 
 No, it does not help. Remember that tha last (almost) portable version
 of PDF is 1.2. If you are to open tagged PDF or even PDF with a
 toUnicode map or a colorspace other than RGB or CMYK in Acrobat Reader
 3, it displays a fatal error and dies. I reported it to Adobe in March
 2001 and they did nothing.

What else would you expect?
AR is at version 10 now.
On Linux it is at version 9 now, indeed 9.4.6 is current.

You don't expect TeX formats prior to TeX3 to handle non-ascii 
characters, so why would you expect other people's older software 
versions to handle documents written for later formats?

 I even reported another fatal bug in
 January 2001. I sent sample files but nothing happened, Adobe just
 stopped development of Acrobat Reader at buggy version 3 for some
 operating systems.

Why should they support OSs that have a limited life-time?
Industry moves on. A new computer is very cheap these days,
with software that can do things your older one never could do.

By all means keep the old one while it still does useful work, 
but you get another to do things that the older cannot handle.

 Why do you so much rely on Adobe? When exchanging
 structured documents I will always do it in XML and never create
 tagged PDF because ...

PDF, as a published standard, is not maintained by Adobe itself 
these days, yet Adobe continues to provide a free reader, at least 
for the visual aspects. That makes documents in PDF viewable by 
everyone (who is only interested in the visual aspect).

It is an ISO standard, which publishers will want to use.
Most of the people who use (La)TeX are academics or others
who need to do a fair amount of publishing, of one kind
or another.

TeX can be modified to become capable of producing Tagged PDF.
 (See the videos of my talks.)
Free software (Poppler) is being developed to handle most aspects
of PDF content, though it hasn't yet progressed enough to support
structure tagging. It's surely on the list of things to do.

  ... I know that some users will be unable to read them
 by Adobe Acrobat Reader.

Why not?
It is not Adobe Reader that is holding them back.

 I do not wish to make them dependent on
 ghostscript and similar tools.

You'll have to give some more details of who you are
referring to her, and why their economic circumstances 
require them to have access to XML-transmitted data,
but preclude them from access to other kinds of standard 
computing software and devices.


 -- 
 Zdeněk Wagner
 http://hroch486.icpf.cas.cz/wagner/
 http://icebearsoft.euweb.cz


Hope this helps,

Ross


Ross Moore   ross.mo...@mq.edu.au 
Mathematics Department   office: E7A-419  
Macquarie University tel: +61 (0)2 9850 8955
Sydney, Australia  2109  fax: +61 (0)2 9850 8114







--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-17 Thread Ross Moore
Hi Phil,

On 18/11/2011, at 6:56 AM, Philip TAYLOR wrote:

 Ross, I do not dispute your arguments : I was answering
 Keith's question in an honest way.  I (personally) do not
 think of a space in TeX output as a character at all,
 because I am steeped in TeX philosophy; but I am quite
 willing to accept that /if/ the objective is not to
 produce output for the sake of output, but output for
 subsequent processing as input by another program, then
 there /may/ be an argument for outputting a space as a
 variable-width glyph.
 
 However, I do think that what appears in the output stream
 is a secondary consideration; far more important (IMHO) is
 how we represent that space /within XeTeX/.  

Do you realise how XeTeX works?
Especially when handling non-Latin-based languages?

Essentially it does *nothing at all* after macro expansion.

Instead it passes strings of characters (tokens are converted back 
to characters) to an external process --- namely the font-handling
aspects provided by the computers operating system, or other
software. What returns is a piece of PDF output, along with 
height/depth/width of this piece (i.e. a TeX-like box). 

It is external software, that has been designed to encode the
knowledge of how the particular language script is structured.
This makes all the detailed description of character placement,
perhaps using information contained within the font itself.

Indeed for many fonts, there are no such decisions, since the
font actually does it itself. All that is needed is to place
the character string in the most appropriate position on the page.

XeTeX does play a role in determining whether the box fits on the
line being built. If not, then hyphenation points come into play,
so that alternative break-ups of the character string into smaller
pieces must be considered.


Why am I giving this detail of a description? ...


 There is, I am
 sure, not a suggestion on the table that we start to treat
 a conventional space in XeTeX other than as TeX has traditionally
 treated it, and therefore the real question is (to my mind),
 do we adopt an extension of this traditional TeX treatment
 for non-breaking space, thin-space, and any of the other
 not-quite-standard spaces that Unicode encompasses,

 ... 
Well what if those not-quite-standard space characters
actually play a vital role in the layout of a language script?

Indeed some of them do. For instance, other threads on this
XeTeX list are talking about ZWJ and ZWNJ, and I've already
mentioned things like the LTR and RTL indicators.

Almost certainly many of the other characters are handled
specially already by the OS software that XeTeX passes the
main decisions to. So changing this at input level for XeTeX
could completely change the visual appearance of the output,
in ways that TeX software has no way to fix.

In other terms, those extra space characters are programming
instructions for other non-TeX-based software. XeTeX needs to 
pass them on unchanged, if that software is to give back to
XeTeX the high-quality typeset output building blocks that 
it needs to position on the page.


By accepting Unicode input, and passing it along to other
software, TeX has inherited the ability to handle many, many
more languages and scripts than it ever could do properly before.
This is as well as making a much richer set of fonts available
for use in XeTeX-produced PDFs.
 
It does these things by piggy-backing on the work of others, 
developed by people who might have absolutely no idea of what TeX 
is, nor how it works, and probably would not care even if they did.
It is a win-win all round --- something that is very rare these days. 


But this does come with a price.
It means that XeTeX-produced output can be OS dependent, 
unlike with other TeX software!

Also, successful compilation to the desired output can be
dependent on having the correct version of a font installed.
Many posts on the XeTeX list have been about such issues.


 or do
 we look for an alternative model which /might/ be glyph-
 or character-based ?.

My view is no we should not, at least not to become
the default way that XeTeX handles its input.


By all means write packages that can be used in particular 
situations where such characters are producing observable
unwanted effects on the final output.
But this should be done at the package level 
(e.g. by a \catcode change, and macro definition).

Then the source document will have a line in the preamble
that indicates that there could be a deviation from default
behaviours. This is an indication that there is something
special about the source stream, and someone with appropriate
knowledge has worked out how to deal with it.

But for general (default) usage, the non-ASCII characters
representing Unicode code-points that go in should be treated
as exactly those Unicode code-points. 


Alternatively, use the editor to change the unwanted characters 
to ordinary spaces, or whatever else works well with TeX processing.


Re: [XeTeX] Whitespace in input

2011-11-17 Thread Zdenek Wagner
2011/11/17 Ross Moore ross.mo...@mq.edu.au:
 Hello Zdenek,

 On 18/11/2011, at 7:49 AM, Zdenek Wagner wrote:

 But a formatting instruction for one program cannot serve as reliable input
 for another.
 A heuristic is then needed, to attempt to infer that a programming
 instruction must have been used, and guess what kind of instruction it might
 have been. This is not 100% reliable, so is deprecated in modern methods of
 data storage and document formats.
 XML based formats use tagging, rather that programming instructions. This is
 the modern way, which is used extensively for communicating data between
 different software systems.

 Yes, that's the point. The goal of TeX is nice typographical
 appearance. The goal of XML is easy data exchange. If I want to send
 structured data, I send XML, not PDF.

 These days people want both.


 ** Phil.

 TeX's strength is in its superior ability to position characters on the page
 for maximum visual effect. This is done by producing detailed programming
 instructions within the content stream of the PDF output. However, this is
 not enough to meet the needs of formats such as EPUB, non-visual reading
 software, archival formats, searchability, and other needs.
 Tagged PDF can be viewed as Adobe's response to address these requirements
 as an extension of the visual aspects of the PDF format. It is a direction
 in which TeX can (and surely must) move, to stay relevant within the
 publishing industry of the future.

 Hope this helps,
     Ross

 No, it does not help. Remember that tha last (almost) portable version
 of PDF is 1.2. If you are to open tagged PDF or even PDF with a
 toUnicode map or a colorspace other than RGB or CMYK in Acrobat Reader
 3, it displays a fatal error and dies. I reported it to Adobe in March
 2001 and they did nothing.

 What else would you expect?
 AR is at version 10 now.
 On Linux it is at version 9 now, indeed 9.4.6 is current.

For OS/2 (now eComStation) the latest AR is at version 3 with known
bugs not fixed.

 You don't expect TeX formats prior to TeX3 to handle non-ascii
 characters, so why would you expect other people's older software
 versions to handle documents written for later formats?

 I even reported another fatal bug in
 January 2001. I sent sample files but nothing happened, Adobe just
 stopped development of Acrobat Reader at buggy version 3 for some
 operating systems.

 Why should they support OSs that have a limited life-time?
 Industry moves on. A new computer is very cheap these days,
 with software that can do things your older one never could do.

Yes, since that time OS/3 evolved from version 2 through 3, Warp
Connesct, 4, 4.5, eComstation 1.0, eComStation 1.1 to eComStation 2.0,
yet AR remained and version 3.

 By all means keep the old one while it still does useful work,
 but you get another to do things that the older cannot handle.

If I compare multitasking of OS/2 on my old Celeron 333 MHz with Linux
running on quad core Intel 4.3 Ghz, the winner is still OS/2. If I
have a single thread in mind, 4.3 GHz is of course faster but
multitasking and multithreading is made much better in OS/2. A few
years ago I made a comparison with a long numerical calculation on
OS/2 (Celeron 333 MHz) and Windows XP (Intel 250 MHz). The program
took 16 hours on OS/2 running Apache server at the same time and 240
hours on Windows running only this program. I am not sure that I find
the very same program now but judging form similar programs I would
expect 6 hours on quad core 4.3 GHz with Linux. Are you surprised that
I am not satisfied with progress in HW and OS?

 Why do you so much rely on Adobe? When exchanging
 structured documents I will always do it in XML and never create
 tagged PDF because ...

 PDF, as a published standard, is not maintained by Adobe itself
 these days, yet Adobe continues to provide a free reader, at least
 for the visual aspects. That makes documents in PDF viewable by
 everyone (who is only interested in the visual aspect).

 It is an ISO standard, which publishers will want to use.
 Most of the people who use (La)TeX are academics or others
 who need to do a fair amount of publishing, of one kind
 or another.

 TeX can be modified to become capable of producing Tagged PDF.
     (See the videos of my talks.)
 Free software (Poppler) is being developed to handle most aspects
 of PDF content, though it hasn't yet progressed enough to support
 structure tagging. It's surely on the list of things to do.

Yes, it is good for extraction even on OS/2 (I do not know whether
people compiled poppler, but xpdf binaries are available).

  ... I know that some users will be unable to read them
 by Adobe Acrobat Reader.

 Why not?
 It is not Adobe Reader that is holding them back.

 I do not wish to make them dependent on
 ghostscript and similar tools.

 You'll have to give some more details of who you are
 referring to her, and why their economic circumstances
 require them to have access to 

Re: [XeTeX] Whitespace in input

2011-11-17 Thread Keith J. Schultz
Hi Philip,

We are basically are following the same lines. 

TeX is foremost a layout program based standard printers
methology.where the space character is white space and not a glyph.

We actually, do have to differentiate between the two in discussions.

The crux of of the problem is in (Xe)TeX's parsing algorithm. I never liked it
and personally I have many problems it. 

regards
Keith.

Am 17.11.2011 um 13:53 schrieb Philip TAYLOR:

 
 
 Keith J. Schultz wrote:
 
 Am 17.11.2011 um 11:26 schrieb Keith J. Schultz:
 
 O.K.
 
 You mention in a later post that you do consider a space as a printable 
 character.
  This line should read as:
  You mention in a later post that you consider a space as a 
 non-printable character.
 
 No, I don't think of it as a character at all, when we are talking
 about typeset output (as opposed to ASCII (or Unicode) input).  Clearly
 it is a character on input, but unless it generates a glyph in the
 output stream (which TeX does not, for normal spaces) then it is not
 a character (/qua/ character) on output but rather a formatting
 instruction not dissimilar to (say) end-of-line.




--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-17 Thread Philip TAYLOR



Keith J. Schultz wrote:


The crux of of the problem is in (Xe)TeX's parsing algorithm. I never liked it
and personally I have many problems it.


Is this XeTeX-specific, Keith, or do you also dislike
TeX's parsing algorithm ?  And what is it that you
dislike, and how would you propose that it be improved ?

** Phil.


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Keith J. Schultz
Hi Tobias,

Am 14.11.2011 um 18:42 schrieb Tobias Schoel:

 
 
 Am 14.11.2011 18:30, schrieb msk...@ansuz.sooke.bc.ca:
[snip, snip]
 Now we come to the trouble of Unicode specifying a line-breaking algorithm ( 
 http://www.unicode.org/reports/tr14/tr14-26.html ), which probably isn't 
 exactly TeX's. I'm not into these algorithms, so I can't compare. But I would 
 ask some Master of this Art to speak up about this conflict.
I went and briefly look at the annex. In the beginning it states that 
the annexes are not necessarily a requirement unless mentioned in the standard!
I did not check the standard, but as you read on the description of the 
LBA is not mandatory at all. 
Furthermore, it more or less describes which characters are directly 
involved with line breaking (top of table 1).
The rest is just a suggest how one Might go about achieving line 
breaking. This is not a standard at all.  

Since TeX has its own line breaking algorithms we need not be 
interrested with the content of this annex as far as Unicode is concerned.
What you should be aware of is that the LBA is intended as an aide for 
a preprocessor to a more elaborate line breaking algorithm.
It has been approved for printing, but no where does it state that it 
must be followed nor that it is complete. 
In other words it is merely a suggestion.

There is no conflict per se. Just another way of dealing with line 
breaking. There is no real standard for line breaking.
It is more or less a matter of taste, style and aesthetics. (Yes, there 
are many conventions that should be observed,
and many are grammatical in nature).

regards
Keith.





--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Chris Travers
On Tue, Nov 15, 2011 at 2:27 AM, Keith J. Schultz keithjschu...@web.de wrote:
 Hi all,

 I agree that XeTeX should support all printable characters.

Given your definition I would say all visible printed characters.
Invisible characters are a problem in a programming language.

 A non.breaking space is to me a printable character, in so far that
 it is important and must be used to distinguish between word space, et all.

As long as this is an option which defaults to off, again I have no
problem with this.   I mean by this definition, carriage returns and
line feeds are also printable characters, and these are supported by
options which are turned on rather than on by default.

 To go back in history, one of my pet peeves in LaTeX was that I had to
 enter the German characters öäüß as \o, \a, etc and later the
 short cut forms s, u, etc. later with inputenc I finally, could just enter
 öäüß.But I had trouble, (actually just needed to convert) my files to and from
 apple to windows (so that editing was possible on windows).

 Yet, I still had trouble with quoting, so I was force to use \quote, et al.
 to have a simple method of quoting properly in english, german and french
 in one document! I even modified them to suite some requirements I need and
 I had one command.

 Unicode has thankfully change all this. I can forget about using all those TeX
 commands for the characters I need. I just type away.

 The only problem is now is the keyboard equivalents and how the editor of 
 choice
 displays them.

But here you have a problem.  An editor can display a non-breaking
space as its semantic value (i.e. with a special glyph, but this is
not without problems.  For example, we could also display line feeds
as the paragraph symbol but now that's also U+00B6, so now you have
ambiguity issues-- is it a unicode character or is it a line feed).
or you can color code, but this is problematic for a large number of
other reasons.

So I am not sure these are simple problems that admit of simple solutions.

My recommendation is:

1)  Default to handling all white space as it exists now.
2)  Provide some sort of switch, whether to the execution of XeTeX or
to the document itself, to turn on handling of special unicode
characters.
3)  If that switch is enabled, then treat the whitespaces according to
unicode meanings.  If not, treat them as standard whitespace.

The advantage of this approach is that people who don't want to worry
about what sort of whitespace is in text files they are inputting
don't have to worry about it, and that those who do have an easy way
of determining if a layout issue is caused by non-breaking spaces.

Best Wishes,
Chris Travers



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Philip and Le Khanh



Keith J. Schultz wrote:


A non.breaking space is to me a printable character, in so far that
it is important and must be used to distinguish between word space, et all.


If, for you, [a] non.breaking space is a printable character, then
presumably that character must be taken from some font.  If you take
a character from a font, it will have a size, and although it can be
combined with kerning rules to adjust its position w.r.t. adjacent
characters,  the logic for this is fairly restricted.  In particular,
it cannot take into account the amount by which TeX is seeking to
expand or contract spaces on the current line in order to achieve
optimal paragraphs.  So in your model of the ideal universe, 
non-breaking Unicode spaces would not behave as do conventional

TeX non-breaking spaces (which /do/ expand and contract to assist
in TeX's line-breaking), nor would they conform to their Unicode
definition where their decomposition is defined as :

noBreak SPACE (U+0020)

I wonder if you would like to discuss these points ?

Philip Taylor


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Mike Maxwell

On 11/15/2011 5:39 AM, Chris Travers wrote:

My recommendation is:

1)  Default to handling all white space as it exists now.
2)  Provide some sort of switch, whether to the execution of XeTeX or
to the document itself, to turn on handling of special unicode
characters.
3)  If that switch is enabled, then treat the whitespaces according to
unicode meanings.  If not, treat them as standard whitespace.


I think you asked me earlier whether that would satisfy me, and I failed 
to answer. Yes, it would.

--
Mike Maxwell
maxw...@umiacs.umd.edu
My definition of an interesting universe is
one that has the capacity to study itself.
--Stephen Eastmond


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Zdenek Wagner
2011/11/15 Mike Maxwell maxw...@umiacs.umd.edu:
 On 11/15/2011 5:39 AM, Chris Travers wrote:

 My recommendation is:

 1)  Default to handling all white space as it exists now.
 2)  Provide some sort of switch, whether to the execution of XeTeX or
 to the document itself, to turn on handling of special unicode
 characters.
 3)  If that switch is enabled, then treat the whitespaces according to
 unicode meanings.  If not, treat them as standard whitespace.

 I think you asked me earlier whether that would satisfy me, and I failed to
 answer. Yes, it would.

But such a solution is not clean, you cannot plug in such logic to the
TeX mouth when the input is being read nor to the output stage when
TECkit maps are in effect. I wrote the reasons earlier. The only
reasonable solution seems to be the one suggested by Phil Taylor, to
extend \catcode up to 255 and assign special categories to other types
of characters. Thus we could say that normal space id 10, nonbreakable
space is 16, thin space is 17 etc. XeTeX will then be able to treat
them properly.

 --
        Mike Maxwell
        maxw...@umiacs.umd.edu
        My definition of an interesting universe is
        one that has the capacity to study itself.
        --Stephen Eastmond


 --
 Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex




-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Chris Travers
2011/11/15 Zdenek Wagner zdenek.wag...@gmail.com:
 2011/11/15 Mike Maxwell maxw...@umiacs.umd.edu:
 On 11/15/2011 5:39 AM, Chris Travers wrote:

 My recommendation is:

 1)  Default to handling all white space as it exists now.
 2)  Provide some sort of switch, whether to the execution of XeTeX or
 to the document itself, to turn on handling of special unicode
 characters.
 3)  If that switch is enabled, then treat the whitespaces according to
 unicode meanings.  If not, treat them as standard whitespace.

 I think you asked me earlier whether that would satisfy me, and I failed to
 answer. Yes, it would.

 But such a solution is not clean, you cannot plug in such logic to the
 TeX mouth when the input is being read nor to the output stage when
 TECkit maps are in effect. I wrote the reasons earlier. The only
 reasonable solution seems to be the one suggested by Phil Taylor, to
 extend \catcode up to 255 and assign special categories to other types
 of characters. Thus we could say that normal space id 10, nonbreakable
 space is 16, thin space is 17 etc. XeTeX will then be able to treat
 them properly.

But we are talking two different things here.  The first is user
interface, and the second is mechanism.

What I am saying is special handling of this sort should be required
to be enabled somehow by the user.  I don't really care how.  It could
be by a commandline switch to xelatex.  It could be by a call in the
document if that's possible.  It should be optional, and disabled by
default, given that the characters involved are not intended to be
displayed with glyphs.

Best Wishes,
Chris Travers



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Philip TAYLOR



Zdenek Wagner wrote:


The only  reasonable solution seems to be the one suggested by Phil Taylor, to
extend \catcode up to 255 and assign special categories to other types
of characters. Thus we could say that normal space id 10, nonbreakable
space is 16, thin space is 17 etc. XeTeX will then be able to treat
them properly.


which may, unfortunately, then require new types of node
in TeX's internal list structures ...

(may, not will).

** Phil.


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Philip TAYLOR



Chris Travers wrote:


But we are talking two different things here.  The first is user
interface, and the second is mechanism.

What I am saying is special handling of this sort should be required
to be enabled somehow by the user.  I don't really care how.  It could
be by a commandline switch to xelatex.  It could be by a call in the
document if that's possible.  It should be optional, and disabled by
default, given that the characters involved are not intended to be
displayed with glyphs.


But /if/ it requires a change to the number of category codes
(and/or the creation of one or more classes of internal node),
then this is not something that should be capable of being
turned on or off within a document.  I don't have any problem
with the idea of turning the functionality on or off either
within a format file or from a command-line qualifier.

** Phil.


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Zdenek Wagner
2011/11/15 Chris Travers chris.trav...@gmail.com:
 2011/11/15 Zdenek Wagner zdenek.wag...@gmail.com:
 2011/11/15 Mike Maxwell maxw...@umiacs.umd.edu:
 On 11/15/2011 5:39 AM, Chris Travers wrote:

 My recommendation is:

 1)  Default to handling all white space as it exists now.
 2)  Provide some sort of switch, whether to the execution of XeTeX or
 to the document itself, to turn on handling of special unicode
 characters.
 3)  If that switch is enabled, then treat the whitespaces according to
 unicode meanings.  If not, treat them as standard whitespace.

 I think you asked me earlier whether that would satisfy me, and I failed to
 answer. Yes, it would.

 But such a solution is not clean, you cannot plug in such logic to the
 TeX mouth when the input is being read nor to the output stage when
 TECkit maps are in effect. I wrote the reasons earlier. The only
 reasonable solution seems to be the one suggested by Phil Taylor, to
 extend \catcode up to 255 and assign special categories to other types
 of characters. Thus we could say that normal space id 10, nonbreakable
 space is 16, thin space is 17 etc. XeTeX will then be able to treat
 them properly.

 But we are talking two different things here.  The first is user
 interface, and the second is mechanism.

 What I am saying is special handling of this sort should be required
 to be enabled somehow by the user.  I don't really care how.  It could
 be by a commandline switch to xelatex.  It could be by a call in the
 document if that's possible.  It should be optional, and disabled by
 default, given that the characters involved are not intended to be
 displayed with glyphs.

The mechanism is simple, set this \catcode to 13 and define it as
\nobreak\space. If you wish to make it clever in all XeLaTeX corners,
find one of my previous posts to see what has to be taken into
account. It may be present in a package called nbsp.sty or so. No
change in XeTeX is needed if you do it this way.

 Best Wishes,
 Chris Travers



 --
 Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex




-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Zdenek Wagner
2011/11/15 Philip TAYLOR p.tay...@rhul.ac.uk:


 Zdenek Wagner wrote:

 The only  reasonable solution seems to be the one suggested by Phil
 Taylor, to
 extend \catcode up to 255 and assign special categories to other types
 of characters. Thus we could say that normal space id 10, nonbreakable
 space is 16, thin space is 17 etc. XeTeX will then be able to treat
 them properly.

 which may, unfortunately, then require new types of node
 in TeX's internal list structures ...

 (may, not will).

Sure, the change will not be trivial. I do not know how the category
codes are stored internally but extending them from 16 possible values
to 256 may require dramatic change in the internal structures.

 ** Phil.




-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Zdenek Wagner
2011/11/15 Philip TAYLOR p.tay...@rhul.ac.uk:


 Chris Travers wrote:

 But we are talking two different things here.  The first is user
 interface, and the second is mechanism.

 What I am saying is special handling of this sort should be required
 to be enabled somehow by the user.  I don't really care how.  It could
 be by a commandline switch to xelatex.  It could be by a call in the
 document if that's possible.  It should be optional, and disabled by
 default, given that the characters involved are not intended to be
 displayed with glyphs.

 But /if/ it requires a change to the number of category codes
 (and/or the creation of one or more classes of internal node),
 then this is not something that should be capable of being
 turned on or off within a document.  I don't have any problem
 with the idea of turning the functionality on or off either
 within a format file or from a command-line qualifier.

If you know what such characters are (and it will certainly be
documented), you just set their categories back to 12 in order to get
the old behaviour.

 ** Phil.


 --
 Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex




-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Philip TAYLOR



Zdenek Wagner wrote:


If you know what such characters are (and it will certainly be
documented), you just set their categories back to 12 in order to get
the old behaviour.


No ! A catcode is for life, not just for Christmas !  Once a
character has been read, and bound into a character/catcode pair,
that catcode remains immutable.  That means that code that is /not/
expecting to have to deal with non-standard catcodes could none the
less be passed token lists containing such entities if it is
possible, within a document, to turn such a feature on and
off again.

** Phil.


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Arthur Reutenauer
On Tue, Nov 15, 2011 at 02:20:17PM +, Philip TAYLOR wrote:
 No ! A catcode is for life, not just for Christmas !  Once a
 character has been read, and bound into a character/catcode pair,
 that catcode remains immutable.

  Do you mean that as a general good practice in TeX programming, or as
a description of how TeX works?  The latter is obviously wrong.

Arthur


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Zdenek Wagner
2011/11/15 Philip TAYLOR p.tay...@rhul.ac.uk:


 Zdenek Wagner wrote:

 If you know what such characters are (and it will certainly be
 documented), you just set their categories back to 12 in order to get
 the old behaviour.

 No ! A catcode is for life, not just for Christmas !  Once a
 character has been read, and bound into a character/catcode pair,
 that catcode remains immutable.  That means that code that is /not/
 expecting to have to deal with non-standard catcodes could none the
 less be passed token lists containing such entities if it is
 possible, within a document, to turn such a feature on and
 off again.

Of course, I know it. What I meant was that you could set \catcode of
all these extended characters to 12 at the beginning of your
document. Thus you get the same behaviour as now.

 ** Phil.


 --
 Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex




-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Philip TAYLOR



Arthur Reutenauer wrote:

On Tue, Nov 15, 2011 at 02:20:17PM +, Philip TAYLOR wrote:

No ! A catcode is for life, not just for Christmas !  Once a
character has been read, and bound into a character/catcode pair,
that catcode remains immutable.


   Do you mean that as a general good practice in TeX programming, or as
a description of how TeX works?  The latter is obviously wrong.


The latter is what the TeXbok says (P.~39) : Once a category code
has been attached to a character token, the attachment is permanent.

** Phil.


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Herbert Schulz

On Nov 15, 2011, at 8:52 AM, Philip TAYLOR wrote:

 
 
 Arthur Reutenauer wrote:
 On Tue, Nov 15, 2011 at 02:20:17PM +, Philip TAYLOR wrote:
 No ! A catcode is for life, not just for Christmas !  Once a
 character has been read, and bound into a character/catcode pair,
 that catcode remains immutable.
 
   Do you mean that as a general good practice in TeX programming, or as
 a description of how TeX works?  The latter is obviously wrong.
 
 The latter is what the TeXbok says (P.~39) : Once a category code
 has been attached to a character token, the attachment is permanent.
 
 ** Phil.


Howdy,

What happens in a verbatim environment?

Good Luck,

Herb Schulz
(herbs at wideopenwest dot com)





--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Philip TAYLOR



Zdenek Wagner wrote:


Of course, I know it. What I meant was that you could set \catcode of
all these extended characters to 12 at the beginning of your
document. Thus you get the same behaviour as now.


Ah yes : with that, I have no problem.
** Phil.


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Zdenek Wagner
2011/11/15 Herbert Schulz he...@wideopenwest.com:

 On Nov 15, 2011, at 8:52 AM, Philip TAYLOR wrote:



 Arthur Reutenauer wrote:
 On Tue, Nov 15, 2011 at 02:20:17PM +, Philip TAYLOR wrote:
 No ! A catcode is for life, not just for Christmas !  Once a
 character has been read, and bound into a character/catcode pair,
 that catcode remains immutable.

   Do you mean that as a general good practice in TeX programming, or as
 a description of how TeX works?  The latter is obviously wrong.

 The latter is what the TeXbok says (P.~39) : Once a category code
 has been attached to a character token, the attachment is permanent.

 ** Phil.


 Howdy,

 What happens in a verbatim environment?

It will have to be redefined, there will just be additional special
characters that will have to be handled. \XeTeXrevision will give you
information whether extended \catcode is implemented.

 Good Luck,

 Herb Schulz
 (herbs at wideopenwest dot com)





 --
 Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex




-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Arthur Reutenauer
 The latter is what the TeXbok says (P.~39) : Once a category code
 has been attached to a character token, the attachment is permanent.

  Yes, because you meant individual tokens (which I understood in
retrospect).  But in the context of the discussion, you really seemed to
be saying that you could not change the \catcode's of characters to be
read, which was the point (not that there is much point left to the
whole threads any more...)

Arthur


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Philip TAYLOR



Arthur Reutenauer wrote:

The latter is what the TeXbok says (P.~39) : Once a category code
has been attached to a character token, the attachment is permanent.


   Yes, because you meant individual tokens (which I understood in
retrospect).  But in the context of the discussion, you really seemed to
be saying that you could not change the \catcode's of characters to be
read, which was the point (not that there is much point left to the
whole threads any more...)


No no : changing catodes on the fly is standard TeX programming;
what we should not contemplate is changing the /number/ of catcodes
on the fly ...

** Phil.


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Philip TAYLOR



Herbert Schulz wrote:


The latter is what the TeXbok says (P.~39) : Once a category code
has been attached to a character token, the attachment is permanent.

** Phil.



What happens in a verbatim environment?


The verbatim environment sets up an environment within
which characters that have not yet been seen by TeX's
mouth receive category codes that potentially differ
from the category code that would normally be associated
with that character.  Once the category code has been
bound to a particular instance of that character, that
instance never changes its catcode.

** Phil.


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Herbert Schulz

On Nov 15, 2011, at 11:19 AM, Philip TAYLOR wrote:

 
 
 Herbert Schulz wrote:
 
 The latter is what the TeXbok says (P.~39) : Once a category code
 has been attached to a character token, the attachment is permanent.
 
 ** Phil.
 
 What happens in a verbatim environment?
 
 The verbatim environment sets up an environment within
 which characters that have not yet been seen by TeX's
 mouth receive category codes that potentially differ
 from the category code that would normally be associated
 with that character.  Once the category code has been
 bound to a particular instance of that character, that
 instance never changes its catcode.
 
 ** Phil.


Howdy,

So what you are saying is not that you can't control the catcode of a 
particular character but that you can't change it after it is set and in TeX's 
``stomach.'' That I can agree with.

Good Luck,

Herb Schulz
(herbs at wideopenwest dot com)






--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Herbert Schulz

On Nov 15, 2011, at 11:11 AM, Herbert Schulz wrote:

 
 On Nov 15, 2011, at 11:19 AM, Philip TAYLOR wrote:
 
 
 
 Herbert Schulz wrote:
 
 The latter is what the TeXbok says (P.~39) : Once a category code
 has been attached to a character token, the attachment is permanent.
 
 ** Phil.
 
 What happens in a verbatim environment?
 
 The verbatim environment sets up an environment within
 which characters that have not yet been seen by TeX's
 mouth receive category codes that potentially differ
 from the category code that would normally be associated
 with that character.  Once the category code has been
 bound to a particular instance of that character, that
 instance never changes its catcode.
 
 ** Phil.
 
 
 Howdy,
 
 So what you are saying is not that you can't control the catcode of a 
 particular character but that you can't change it after it is set and in 
 TeX's ``stomach.'' That I can agree with.
 
 Good Luck,
 
 Herb Schulz
 (herbs at wideopenwest dot com)


Howdy,

What I meant to say was...

So what you are saying is not that you can control the catcode of a particular 
character but that you can't change it after it is set and in TeX's 
``stomach.'' That I can agree with.

(notice the can't control --- can control)

Good Luck,

Herb Schulz
(herbs at wideopenwest dot com)






--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Philip TAYLOR

I think it made more sense with can't, Herb,
but that could be a trans-Atlantic difference
of usage -- you would, I think, say I could care
less where I would say I couldn't care less.

** Phil.

Herbert Schulz wrote:


What I meant to say was...

So what you are saying is not that you can control the catcode of a particular 
character but that you can't change it after it is set and in TeX's 
``stomach.'' That I can agree with.

(notice the can't control ---  can control)



--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Herbert Schulz

On Nov 15, 2011, at 2:43 PM, Ross Moore wrote:

 
 On 16/11/2011, at 5:56 AM, Herbert Schulz wrote:
 
 Given that TeX (and XeTeX too) deal wit a non-breakble space already (where 
 we usually use the ~ to represent that space) it seems to me that XeTeX 
 should treat that the same way.
 
 No, I disagree completely.
 
 What if you really want the Ux00A0 character to be in the PDF?
 That is, when you copy/paste from the PDF, you want that character
 to come along for the ride.
 
 In TeX ~ *simulates* a non-breaking space visually, but there is
 no actual character inserted.
 If you want the character you have to ensure that it gets there,
 and what more natural way is there than to put it in explicitly.
 
 This is how XeTeX treats it currently, according to my experiments,
 using just  fontspec  and  Charis SIL font.
 Anyone who has a different experience should check what other
 packages and fonts are being loaded, and whether there is something
 that specifically changes how that character is handled.
 

Howdy,

But isn't that also true about a regular space character? Doesn't (Xe)TeX 
insert some glue rather than a Space Character?

 The big puzzle will happen when someone, not using an editor capable of 
 displaying invisibles, can't understand why they can't get XeTeX to break 
 between the two words.
 
 That is an editor problem, not one that XeTeX itself should be
 concerned with.
 

Agreed. But I'll be you end up with lots of questions on ctt/texhax/etc. about 
line breaking; assuming that the non-breaking space actually does it's ``job.''

 
 Now having Ux00A0 between two words may change the way 
 hyphenation works for those words.
 
 But surely if you are wanting to inhibit a line-break
 between words, you probably also don't want either word to
 be hyphenated. So this could really be the correct thing.
 

or not. :-)

Good Luck,

Herb Schulz
(herbs at wideopenwest dot com)






--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Zdenek Wagner
2011/11/15 Ross Moore ross.mo...@mq.edu.au:

 On 16/11/2011, at 5:56 AM, Herbert Schulz wrote:

 Given that TeX (and XeTeX too) deal wit a non-breakble space already (where 
 we usually use the ~ to represent that space) it seems to me that XeTeX 
 should treat that the same way.

 No, I disagree completely.

 What if you really want the Ux00A0 character to be in the PDF?
 That is, when you copy/paste from the PDF, you want that character
 to come along for the ride.

From the typographical point of view it is the worst of all possible
methods. If you really wish it, then do not use TeX but M$ Word or
OpenOffice. M$ Word automatically inserts nonbreakable spaces at some
points in the text written in Czech. As far as grammer is concerned,
it is correct. However, U+00a0 is fixed width. If you look at the
output, the nonbreakable spaces are too wide on some lines and too
thin on other lines. I cannot imagine anything uglier.


-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Ross Moore
Hi Zdenek,

On 16/11/2011, at 8:58 AM, Zdenek Wagner wrote:

 2011/11/15 Ross Moore ross.mo...@mq.edu.au:
 
 On 16/11/2011, at 5:56 AM, Herbert Schulz wrote:
 
 Given that TeX (and XeTeX too) deal wit a non-breakble space already (where 
 we usually use the ~ to represent that space) it seems to me that XeTeX 
 should treat that the same way.
 
 No, I disagree completely.
 
 What if you really want the Ux00A0 character to be in the PDF?
 That is, when you copy/paste from the PDF, you want that character
 to come along for the ride.
 
 From the typographical point of view it is the worst of all possible
 methods. If you really wish it,

The *really wish it* is the choice of the author, not the
software.

 then do not use TeX but M$ Word or
 OpenOffice. M$ Word automatically inserts nonbreakable spaces at some
 points in the text written in Czech. As far as grammer is concerned,
 it is correct. However, U+00a0 is fixed width. If you look at the
 output, the nonbreakable spaces are too wide on some lines and too
 thin on other lines. I cannot imagine anything uglier.

I do not disagree with you that this could be ugly.
But that is not the point.

If you want superior aesthetic typesetting, with nice choices
for hyphenation, then don't use Ux00A0. Of course!


Whatever the reason for wanting to use this character, there
should be a straight-forward way to do it.
Using the character itself is:
 a.  the most understandable
 b.  currently works
 c.  requires no special explanation.


 
 
 -- 
 Zdeněk Wagner
 http://hroch486.icpf.cas.cz/wagner/
 http://icebearsoft.euweb.cz

Cheers,

Ross


Ross Moore   ross.mo...@mq.edu.au 
Mathematics Department   office: E7A-419  
Macquarie University tel: +61 (0)2 9850 8955
Sydney, Australia  2109  fax: +61 (0)2 9850 8114







--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Zdenek Wagner
2011/11/15 Ross Moore ross.mo...@mq.edu.au:
 Hi Zdenek,

 On 16/11/2011, at 8:58 AM, Zdenek Wagner wrote:

 2011/11/15 Ross Moore ross.mo...@mq.edu.au:

 On 16/11/2011, at 5:56 AM, Herbert Schulz wrote:

 Given that TeX (and XeTeX too) deal wit a non-breakble space already 
 (where we usually use the ~ to represent that space) it seems to me that 
 XeTeX should treat that the same way.

 No, I disagree completely.

 What if you really want the Ux00A0 character to be in the PDF?
 That is, when you copy/paste from the PDF, you want that character
 to come along for the ride.

 From the typographical point of view it is the worst of all possible
 methods. If you really wish it,

 The *really wish it* is the choice of the author, not the
 software.

 then do not use TeX but M$ Word or
 OpenOffice. M$ Word automatically inserts nonbreakable spaces at some
 points in the text written in Czech. As far as grammer is concerned,
 it is correct. However, U+00a0 is fixed width. If you look at the
 output, the nonbreakable spaces are too wide on some lines and too
 thin on other lines. I cannot imagine anything uglier.

 I do not disagree with you that this could be ugly.
 But that is not the point.

 If you want superior aesthetic typesetting, with nice choices
 for hyphenation, then don't use Ux00A0. Of course!


 Whatever the reason for wanting to use this character, there
 should be a straight-forward way to do it.
 Using the character itself is:
  a.  the most understandable
  b.  currently works
  c.  requires no special explanation.

These are reasons why people might wish it in the source files, not in PDF.

If you wish to take a [part of] PDF and include it in another PDF as
is, you can take the PDF directly without the need of grabbing the
text. If you are interested in the text that will be retypeset, you
have to verify a lot of other things. If the text contained hyphenated
words, you have to join the parts manually. You will have a lot of
other work and the time saved by U+00a0 will be negligible. There are
tools that may help you to insert nonbreakable spaces. I have even my
own special tools written in perl to handle one class of input files
that are really plain texts and the result is (almost) correctly
marked LaTeX source.



 --
 Zdeněk Wagner
 http://hroch486.icpf.cas.cz/wagner/
 http://icebearsoft.euweb.cz

 Cheers,

        Ross

 
 Ross Moore                                       ross.mo...@mq.edu.au
 Mathematics Department                           office: E7A-419
 Macquarie University                             tel: +61 (0)2 9850 8955
 Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
 






 --
 Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex




-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Ross Moore
Hi Phil,

On 16/11/2011, at 8:45 AM, Philip TAYLOR wrote:

 Ross Moore wrote:
 
 On 16/11/2011, at 5:56 AM, Herbert Schulz wrote:
 
 Given that TeX (and XeTeX too) deal wit a non-breakble space already (where 
 we usually use the ~ to represent that space) it seems to me that XeTeX 
 should treat that the same way.
 
 No, I disagree completely.
 
 What if you really want the Ux00A0 character to be in the PDF?
 That is, when you copy/paste from the PDF, you want that character
 to come along for the ride.
 
 I'm not sure I entirely go along with this argument, Ross.
 What if you really want the \ character to be in the PDF,
 or the ^ character, or the $ character, or any character
 that TeX currently treats specially ?  

TeX already provides \$ \_ \# etc. for (most of) the other special
characters it uses, but does not for ^^A0 --- but it does not
need to if you can generate it yourself on the keyboard.


 Whilst I can agree
 that there is considerable merit in extending XeTeX such
 that it treats all of these new, special characters
 specially (by creating new catcodes, new node types and so
 on), in the short term I can see no fundamental problem with
 treating U+00A0 in such a way that it behaves indistinguishably
 from the normal expansion of ~.

How do you explain to somebody the need to do something really,
really special to get a character that they can type, or copy/paste?

There is no special role for this character in other vital aspects 
of how TeX works, such as there is for $ _ # etc.


 
 In TeX ~ *simulates* a non-breaking space visually, but there is
 no actual character inserted.
 
 And I don't agree that a space is a character, non-breaking or not !

In this view you are against most of the rest of the world.

If the output is intended to be PDF, as it really has to be with 
XeTeX, then the specifications for the modern variants of PDF 
need to be consulted.

With PDF/A and PDF/UA and anything based on ISO-32000 (PDF 1.7)
there is a requirement that the included content should explicitly
provide word boundaries. Having a space character inserted is by
far the most natural way to meet this specification.
(This does not mean that having such a character in the output
need affect TeX's view of typesetting.)

Before replying to anything in the above paragraph, please
watch the video of my recent talk at TUG-2011.

  http://river-valley.tv/further-advances-toward-tagged-pdf-for-mathematics/

or similar from earlier years where I also talk a bit about such things.

 
 ** Phil.


Hope this helps,

Ross


Ross Moore   ross.mo...@mq.edu.au 
Mathematics Department   office: E7A-419  
Macquarie University tel: +61 (0)2 9850 8955
Sydney, Australia  2109  fax: +61 (0)2 9850 8114







--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Zdenek Wagner
2011/11/15 Ross Moore ross.mo...@mq.edu.au:
 Hi Phil,

 On 16/11/2011, at 8:45 AM, Philip TAYLOR wrote:

 Ross Moore wrote:

 On 16/11/2011, at 5:56 AM, Herbert Schulz wrote:

 Given that TeX (and XeTeX too) deal wit a non-breakble space already 
 (where we usually use the ~ to represent that space) it seems to me that 
 XeTeX should treat that the same way.

 No, I disagree completely.

 What if you really want the Ux00A0 character to be in the PDF?
 That is, when you copy/paste from the PDF, you want that character
 to come along for the ride.

 I'm not sure I entirely go along with this argument, Ross.
 What if you really want the \ character to be in the PDF,
 or the ^ character, or the $ character, or any character
 that TeX currently treats specially ?

 TeX already provides \$ \_ \# etc. for (most of) the other special
 characters it uses, but does not for ^^A0 --- but it does not
 need to if you can generate it yourself on the keyboard.

00a0

 Whilst I can agree
 that there is considerable merit in extending XeTeX such
 that it treats all of these new, special characters
 specially (by creating new catcodes, new node types and so
 on), in the short term I can see no fundamental problem with
 treating U+00A0 in such a way that it behaves indistinguishably
 from the normal expansion of ~.

 How do you explain to somebody the need to do something really,
 really special to get a character that they can type, or copy/paste?

 There is no special role for this character in other vital aspects
 of how TeX works, such as there is for $ _ # etc.



 In TeX ~ *simulates* a non-breaking space visually, but there is
 no actual character inserted.

 And I don't agree that a space is a character, non-breaking or not !

 In this view you are against most of the rest of the world.

TeX NEVER outputs a space as a glyph. Text extraction tools usually
interpret horizontal spaces of sufficient size as U+0020.

(The exception to the above mentioned never is the verbatim mode.)

 If the output is intended to be PDF, as it really has to be with
 XeTeX, then the specifications for the modern variants of PDF
 need to be consulted.

 With PDF/A and PDF/UA and anything based on ISO-32000 (PDF 1.7)
 there is a requirement that the included content should explicitly
 provide word boundaries. Having a space character inserted is by
 far the most natural way to meet this specification.

A space character is a fixed-width glyph. If you insist in it, you
will never be able to typeset justified paragraphs, you will move back
to the era of mechanical typewriters.

 (This does not mean that having such a character in the output
 need affect TeX's view of typesetting.)

 Before replying to anything in the above paragraph, please
 watch the video of my recent talk at TUG-2011.

  http://river-valley.tv/further-advances-toward-tagged-pdf-for-mathematics/

 or similar from earlier years where I also talk a bit about such things.


 ** Phil.


 Hope this helps,

        Ross

 
 Ross Moore                                       ross.mo...@mq.edu.au
 Mathematics Department                           office: E7A-419
 Macquarie University                             tel: +61 (0)2 9850 8955
 Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
 






 --
 Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex




-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Karljurgen Feuerherm
I was going to make the following point earlier--maybe in light of
Phil's conclusion I should do it now.

There seems to be a tendency not to distinguish between a(n orginal)
character in the sense of character of a writing system, and a computer
character.

The former are visible symbols on a background medium. The latter are
an entirely different set of symbols which to some extent parallel the
former, and some extent do not. Space, control codes, etc. don't exist
in the former, but exist in the latter because it was a convenient way
to encode certain functions one wished to apply to the encoded other
characters--the ones that correspond more or less to original writing
system characters.

These encoding sets have developed over time, and have consequently
inherited all sorts of legacy issues, not all of which need supporting.
Unicode provides tools. No one says one has to use them all.

Specifically, the purpose of XeTeX and other such engines is to all for
the nice typographical formatting of visual representations of script
characters against some other defined background. From that point of
view, so long as it does it, once it does it, it has achieved its goal.

Transparency of all sorts of other things, providing input via PDF to
other software isn't and shouldn't be a *primary* goal.

That being said, no doubt it might be helpful to some to have this or
that control character passed along. But that's not the essence of the
exercise, and should only be done if it can be done cheaply, i.e.
without a lot of risk to the primary objective.

I guess the real question is that latter part.

K

 On Tue, Nov 15, 2011 at  4:45 PM, in message
4ec2dd63.3040...@rhul.ac.uk,
Philip TAYLOR p.tay...@rhul.ac.uk wrote:


 Ross Moore wrote:

 On 16/11/2011, at 5:56 AM, Herbert Schulz wrote:

 Given that TeX (and XeTeX too) deal wit a non-breakble space
already (where
 we usually use the ~ to represent that space) it seems to me that
XeTeX
 should treat that the same way.

 No, I disagree completely.

 What if you really want the Ux00A0 character to be in the PDF?
 That is, when you copy/paste from the PDF, you want that character
 to come along for the ride.

 I'm not sure I entirely go along with this argument, Ross.
 What if you really want the \ character to be in the PDF,
 or the ^ character, or the $ character, or any character
 that TeX currently treats specially ?  Whilst I can agree
 that there is considerable merit in extending XeTeX such
 that it treats all of these new, special characters
 specially (by creating new catcodes, new node types and so
 on), in the short term I can see no fundamental problem with
 treating U+00A0 in such a way that it behaves indistinguishably
 from the normal expansion of ~.

 In TeX ~ *simulates* a non-breaking space visually, but there is
 no actual character inserted.

 And I don't agree that a space is a character, non-breaking or not !

 ** Phil.


 --
 Subscriptions, Archive, and List information, etc.:
   http://tug.org/mailman/listinfo/xetex



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Ross Moore
Hi Phil,

On 16/11/2011, at 10:08 AM, Zdenek Wagner wrote:

 How do you explain to somebody the need to do something really,
 really special to get a character that they can type, or copy/paste?
 
 There is no special role for this character in other vital aspects
 of how TeX works, such as there is for $ _ # etc.
 
 
 
 In TeX ~ *simulates* a non-breaking space visually, but there is
 no actual character inserted.
 
 And I don't agree that a space is a character, non-breaking or not !
 
 In this view you are against most of the rest of the world.
 
 TeX NEVER outputs a space as a glyph. Text extraction tools usually
 interpret horizontal spaces of sufficient size as U+0020.

I never said that it did, nor that it was necessary to do so.

Those text extraction tools do a pretty reasonable job, but don't
always get it right. Besides, there is reliance on a heuristic,
which can be fallible, especially if there is content typeset in 
a very small font size.
And what about at line-ends? They can get that wrong too.

Such a reliance is rather against the TeX way of doing things,
don't you think?

Better is for TeX itself to apply the heuristic, since it knows
the current font size and the separation between bits of words.

 (The exception to the above mentioned never is the verbatim mode.)

That isn't good enough for TeX to produce PDF/A.
Go and watch the videos that I pointed you to.


Lower down I give a run-down of how a variant of TeX handles
this problem, to very good effect.

 
 If the output is intended to be PDF, as it really has to be with
 XeTeX, then the specifications for the modern variants of PDF
 need to be consulted.
 
 With PDF/A and PDF/UA and anything based on ISO-32000 (PDF 1.7)
 there is a requirement that the included content should explicitly
 provide word boundaries. Having a space character inserted is by
 far the most natural way to meet this specification.
 
 A space character is a fixed-width glyph. If you insist in it, you
 will never be able to typeset justified paragraphs, you will move back
 to the era of mechanical typewriters.

Absolutely wrong!

I'm not insisting on it being included as the natural way to 
separate words within the PDF, though it certainly is a possible
way that is used by other software.

 (This does not mean that having such a character in the output
 need affect TeX's view of typesetting.)

Clearly you never even read this parenthetical statement ...

 
 Before replying to anything in the above paragraph, please
 watch the video of my recent talk at TUG-2011.

 ... and certainly you don't seem to have followed up on this
piece of advice, to get a better perspective of what I'm talking
about.

 
  http://river-valley.tv/further-advances-toward-tagged-pdf-for-mathematics/
 
 or similar from earlier years where I also talk a bit about such things.



Here is how you get *both* TeX-quality typesetting and explicit
spaces as word-boundaries inside the PDF, with no loss of quality.

What the experimental tagged-pdfTeX does is to use a font (called
dummy-space) that contains just a single character at code Ux0020,
at a size that is almost zero -- it cannot be exactly zero, else 
PDF browsers may not select it for copy/paste, or other text-extraction.

These extra spaces are inserted into the PDF content stream, *after*
TeX has determined the correct positioning for high-quality typesetting.
That is, it is *not* done by macros or widgets or suchlike, but is
done internally by the pdfTeX engine at shipout time.

The almost-zero size has no perceptible effect on the visual output.
But the existence of these extra space characters means that all
text-extraction methods work much more reliably.

There *are* extra primitives that can be used to turn this off and on
in places where such extra spaces are not wanted; e.g. in math.
And there is a primitive to insert such a space, in case it is required
manually, for whatever reason. All of these primitives are used
extensively when generating tagged PDF of mathematical expressions,
and are thus available for other usage too.


 
 
 ** Phil.

Hope this helps,

Ross


Ross Moore   ross.mo...@mq.edu.au 
Mathematics Department   office: E7A-419  
Macquarie University tel: +61 (0)2 9850 8955
Sydney, Australia  2109  fax: +61 (0)2 9850 8114







--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Zdenek Wagner
2011/11/16 Ross Moore ross.mo...@mq.edu.au:

 On 16/11/2011, at 9:45 AM, Zdenek Wagner wrote:

 2011/11/15 Ross Moore ross.mo...@mq.edu.au:

 What if you really want the Ux00A0 character to be in the PDF?
 That is, when you copy/paste from the PDF, you want that character
 to come along for the ride.

 From the typographical point of view it is the worst of all possible
 methods. If you really wish it,

 Maybe you misunderstood what I meant here.

 I'm not saying that you might want Ux00A0 for *every* place
 where there is a word-breaking space.
 Just that there may be individual instance(s) where you have
 a reason to want it.

 Just like any other Unicode character, if you want it then
 you should be able to put it in there.

You ARE able to do it. Choose a font with that glyph, set \catcode to
11 or 12 and that's it. What else do you wish to do?

 That's what XeTeX currently does (with the TeX-wise familiar
 ASCII exceptions) for any code-point supported by the
 chosen font.


 The *really wish it* is the choice of the author, not the
 software.

 then do not use TeX but M$ Word or
 OpenOffice. M$ Word automatically inserts nonbreakable spaces at some
 points in the text written in Czech. As far as grammer is concerned,
 it is correct. However, U+00a0 is fixed width. If you look at the
 output, the nonbreakable spaces are too wide on some lines and too
 thin on other lines. I cannot imagine anything uglier.

 I do not disagree with you that this could be ugly.
 But that is not the point.

 If you want superior aesthetic typesetting, with nice choices
 for hyphenation, then don't use Ux00A0. Of course!


 Whatever the reason for wanting to use this character, there
 should be a straight-forward way to do it.
 Using the character itself is:
  a.  the most understandable
  b.  currently works
  c.  requires no special explanation.

 These are reasons why people might wish it in the source files, not in PDF.

 Yes. In the source, to have the occasional such character included
 within the PDF, for whatever reason appropriate to the material
 being typeset -- whether verbatim, or not.


 If you wish to take a [part of] PDF and include it in another PDF as
 is, you can take the PDF directly without the need of grabbing the
 text. If you are interested in the text that will be retypeset, you
 have to verify a lot of other things.

 How is any of this relevant to the current discussion?

It was you who came with the argument that you wish to have
nonbreakable spaces when copying the text from PDF.

 If the text contained hyphenated
 words, you have to join the parts manually. You will have a lot of
 other work and the time saved by U+00a0 will be negligible. There are
 tools that may help you to insert nonbreakable spaces. I have even my
 own special tools written in perl to handle one class of input files
 that are really plain texts and the result is (almost) correctly
 marked LaTeX source.

 All well and good.
 But how is that relevant to anything I said?

See above.



 --
 Zdeněk Wagner
 http://hroch486.icpf.cas.cz/wagner/
 http://icebearsoft.euweb.cz


 Cheers,

        Ross

 
 Ross Moore                                       ross.mo...@mq.edu.au
 Mathematics Department                           office: E7A-419
 Macquarie University                             tel: +61 (0)2 9850 8955
 Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
 






 --
 Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex




-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Philip TAYLOR



Ross Moore wrote:

Hi Phil,

On 16/11/2011, at 10:08 AM, Zdenek Wagner wrote:


Not I, Sir : Zdeněk  !
** Phil.


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-15 Thread Ross Moore
Hi Zdenek,

On 16/11/2011, at 11:19 AM, Zdenek Wagner wrote:

 Just like any other Unicode character, if you want it then
 you should be able to put it in there.
 
 You ARE able to do it. Choose a font with that glyph, set \catcode to
 11 or 12 and that's it. What else do you wish to do?

The *default* behaviour should stay as this.
Any other behaviour needs to change the catcode
and make perhaps a definition.

 These are reasons why people might wish it in the source files, not in PDF.
 
 Yes. In the source, to have the occasional such character included
 within the PDF, for whatever reason appropriate to the material
 being typeset -- whether verbatim, or not.


 If you wish to take a [part of] PDF and include it in another PDF as
 is, you can take the PDF directly without the need of grabbing the
 text. If you are interested in the text that will be retypeset, you
 have to verify a lot of other things.
 
 How is any of this relevant to the current discussion?
 
 It was you who came with the argument that you wish to have
 nonbreakable spaces when copying the text from PDF.

No. I said that if you put one in, then you should be
expecting to get one out.
This should be the default behaviour, as it is now.

I certainly suggested nothing like getting out non-breaking
spaces as a replacement for anything else.


 Zdeněk Wagner
 http://hroch486.icpf.cas.cz/wagner/
 http://icebearsoft.euweb.cz



Hope this helps,

Ross


Ross Moore   ross.mo...@mq.edu.au 
Mathematics Department   office: E7A-419  
Macquarie University tel: +61 (0)2 9850 8955
Sydney, Australia  2109  fax: +61 (0)2 9850 8114







--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-14 Thread Philip TAYLOR



msk...@ansuz.sooke.bc.ca wrote:

various points with which I have no reason to disagree at this time, followed 
by

 2. Inevitably, people will include invalid characters in TeX input; and
 U+00A0 is an invalid character for TeX input.

Firstly (as is clear from the list on which we are discussing
this), we are not discussing TeX but XeTeX.  Secondly, even
if we were discussing TeX, on what basis do you claim that
U+00A0 is invalid ?  And if you assert that it is, /a priori/,
invalid for TeX, and if your reasons for that assertion are
sound, do they also support the assertion that it is, /a priori/,
invalid for XeTeX ?

Remainder snipped, so that we can debate one point at a time.

Philip Taylor


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-14 Thread Zdenek Wagner
2011/11/14 Philip TAYLOR p.tay...@rhul.ac.uk:


 msk...@ansuz.sooke.bc.ca wrote:

 various points with which I have no reason to disagree at this time,
 followed by

 2. Inevitably, people will include invalid characters in TeX input; and
 U+00A0 is an invalid character for TeX input.

 Firstly (as is clear from the list on which we are discussing
 this), we are not discussing TeX but XeTeX.  Secondly, even
 if we were discussing TeX, on what basis do you claim that
 U+00A0 is invalid ?  And if you assert that it is, /a priori/,
 invalid for TeX, and if your reasons for that assertion are
 sound, do they also support the assertion that it is, /a priori/,
 invalid for XeTeX ?

 Remainder snipped, so that we can debate one point at a time.

I agree with Phil there is nothing in TeX that makes a character
invalid a priori. It is made invalid by \catcode.

There are two aspects:

A. We are preparing a document to be typeset by TeX. Why on earth
should we use only U+00a0 and not ~ which is clearly visible in any
editor and has been used for a nonbreakable space for years? Why we
use  in \halign or \begin{tabular} and not U+0009?

B. TeX is used to typeset data extracted from a database (or similar
source) that was not TeX-aware at the first place. Such data can
contain not only U+00a0 but even texts as Tweedledum  Tweedledee,
12 $, 15 %, #1, whatever. In such a case we must be aware that
the input may contain arbitrary characters, even those playing special
roles in TeX. We have to handle them properly.

 Philip Taylor


 --
 Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex




-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-14 Thread mskala
On Mon, 14 Nov 2011, Philip TAYLOR wrote:
  2. Inevitably, people will include invalid characters in TeX input; and
  U+00A0 is an invalid character for TeX input.

 Firstly (as is clear from the list on which we are discussing
 this), we are not discussing TeX but XeTeX.  Secondly, even

XeTeX is a TeX engine.  Obviously, it is free to define its own input
format, and that format already differs from other TeX engines by (for
instance) allowing some Unicode code points outside the 7-bit range.  But
I still see XeTeX as a version of TeX, not something completely different,
and it's appropriate for expectations we might have about TeX - for
instance, the expectation that formatting commands are visible and the
non-breaking space formatting command is ~ - to also apply to XeTeX
where they are appropriate.

 if we were discussing TeX, on what basis do you claim that
 U+00A0 is invalid ?  And if you assert that it is, /a priori/,

It's invalid if XeTeX says it is invalid, and I think XeTeX should say
it is invalid.

-- 
Matthew Skala
msk...@ansuz.sooke.bc.ca People before principles.
http://ansuz.sooke.bc.ca/



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-14 Thread Philip TAYLOR



msk...@ansuz.sooke.bc.ca wrote:


XeTeX is a TeX engine.  Obviously, it is free to define its own input
format, and that format already differs from other TeX engines by (for
instance) allowing some Unicode code points outside the 7-bit range.


I think (with respect) that some Unicode code points outside the 7-bit range
is a gross understatement.  As far as I am aware, XeTeX permits a very 
considerable
subset of Unicode (perhaps even all of it; I do not know) as input.


if we were discussing TeX, on what basis do you claim that
U+00A0 is invalid ?  And if you assert that it is, /a priori/,


It's invalid if XeTeX says it is invalid, and I think XeTeX should say
it is invalid.


That is a very different statement, and as that is your
personal position, I respect it as such.  Of course,
I disagree :-)

** Phil.


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-14 Thread mskala
On Mon, 14 Nov 2011, Philip TAYLOR wrote:
 I think (with respect) that some Unicode code points outside the 7-bit range
 is a gross understatement.  As far as I am aware, XeTeX permits a very
 considerable
 subset of Unicode (perhaps even all of it; I do not know) as input.

My point is that it shouldn't treat U+00A0 as equivalent to U+007E, or
as valid at all, just because it supports Unicode.  That is not what
supporting Unicode means.
-- 
Matthew Skala
msk...@ansuz.sooke.bc.ca People before principles.
http://ansuz.sooke.bc.ca/


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-14 Thread Tobias Schoel



Am 14.11.2011 18:30, schrieb msk...@ansuz.sooke.bc.ca:

1.  No.  That is not what Unicode is for.  Unicode's goal is to subsume
all reasonable pre-existing encodings.

Unicode is even more. Look at all the Annexes to Unicode 6.0

 Some reasonable pre-existing

encodings include a non-breaking space character, so Unicode includes one.
That does not mean Unicode says you should actually use it!  There are
many precedents of Unicode providing multiple ways of representing
things, as a result of including characters from other systems, without
it being reasonable to demand that all Unicode-compatible systems must
support all of them.  For instance, most of the U+FFxx range is devoted
to different kinds of hacks for handling partial-width characters in
Asian-language typesetting; the preferred way to do that nowadays is via
OpenType features, but the code points remain in the standard.  The U+
to U+001F range is basically control characters for Teletype machines;
some of those, like U+000A and U+000D, are widely used in modern documents
(but in varying ways by different systems!) and others, like U+001D, are
virtually unheard-of.  Unicode does NOT say everybody has to support them
all let alone all in the same way.
Hmm, I have difficulties exactly understanding the conformance chapter 
of Unicode 6.0 ( http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf 
), but it seems to me, that claiming unicode support seems a very strong 
statement.




The U+00A0 code points is not explicitly deprecated in Unicode, but it was
never a principle of Unicode that all implementations have to support all
defined control characters regardless of appropriateness to the particular
purpose.  Non-breaking space is, from TeX's point of view, not really a
character at all, but a formatting command; and TeX already has a way of
dealing with formatting commands in general and this one in particular.
It is appropriate to say that the preferred way of handling non-breaking
spaces in TeX input is the existing TeX way; and saying that in NO WAY AT
ALL contradicts anything in Unicode.  Unicode is servant, not master.

I think it's more like math being servant _and_ master of natural sciences.


2. Inevitably, people will include invalid characters in TeX input; and
U+00A0 is an invalid character for TeX input.  The best way to deal with
it is to treat it like any other invalid character and generate an error
message.  A reasonable alternative would be to say it is whitespace; it
will be treated like other whitespace.  That would mean ignoring its
breaking/non-breaking-ness, as we have for a long time similarly ignored
the special properties of U+0009 (tab).  Of course, if users want to
define a special meaning for U+00A0 in their own input, they can do so
with the existing mechanisms for redefining the meanings of input
characters; but U+00A0 is equivalent to U+007E (~), for instance, should
never be the default and (because of trouble displaying it) shouldn't be
encouraged.
Now we come to the trouble of Unicode specifying a line-breaking 
algorithm ( http://www.unicode.org/reports/tr14/tr14-26.html ), which 
probably isn't exactly TeX's. I'm not into these algorithms, so I can't 
compare. But I would ask some Master of this Art to speak up about this 
conflict.




3. No.  Better to keep everything visible and backward compatible.  U+007E
(~) should remain the preferred way of doing non-breaking space.

Should and is … (see other posts).


4. Not applicable because of the answer to #3.  Users who do insist on
putting U+00A0 in their input presumably have *already* got their own
reasons to think that it's more convenient for them, including solutions
satisfactory to themselves for how to type it on keyboards and see it on
screens, so that's their business and not a problem we need to solve.

I'm personally trying hard to find a correct way. As of now, I have 
found a very simple solution to input special whitespace characters. 
(Using Linux, doing this is easy business with ibus.) Alas, I haven't 
found any editor suited better to my TeX needs than Kile, but I haven't 
yet managed to highlight these special whitespace characters properly.
= Some experts can do all these things. That doesn't mean, everyone 
else should stick do stupid old ASCII-7.


bye

Toscho


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-14 Thread Philip TAYLOR



msk...@ansuz.sooke.bc.ca wrote:

On Mon, 14 Nov 2011, Philip TAYLOR wrote:

I think (with respect) that some Unicode code points outside the 7-bit range
is a gross understatement.  As far as I am aware, XeTeX permits a very
considerable
subset of Unicode (perhaps even all of it; I do not know) as input.


My point is that it shouldn't treat U+00A0 as equivalent to U+007E, or
as valid at all, just because it supports Unicode.  That is not what
supporting Unicode means.


I agree with your opinion that it should not
treat U+00A0 as equivalent to U+007E -- indeed,
the Unicode standard specifies as its canonical
decomposition :

noBreak SPACE (U+0020)

However, I cannot agree that it should not be
treated as valid; that is just the thin end of
the wedge, and I would sooner there were no
wedge at all.  XeTeX's primary strength is that
it supports Unicode; we should not weaken that
strength by requiring that it supports some parts
of Unicode and not others.

My EUR 0,02.
** Phil.


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-14 Thread Karljurgen Feuerherm
 On Mon, Nov 14, 2011 at 12:15 PM, in message
4ec14cb5.7000...@rhul.ac.uk,
Philip TAYLOR p.tay...@rhul.ac.uk wrote:

 XeTeX is a TeX engine.  Obviously, it is free to define its own
input
 format, and that format already differs from other TeX engines by
(for
 instance) allowing some Unicode code points outside the 7-bit
range.

 I think (with respect) that some Unicode code points outside the
7-bit
 range
 is a gross understatement.  As far as I am aware, XeTeX permits a
very
 considerable
 subset of Unicode (perhaps even all of it; I do not know) as input.

I use U+12000 and above regularly, as a case in point...

K


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-14 Thread mskala
On Mon, 14 Nov 2011, Karljurgen Feuerherm wrote:
 I use U+12000 and above regularly, as a case in point...

Do you think that basic formatting control functions should be bound to
code points in that range, as the preferred way of accessing those
functions?  Let's not lose track of what this discussion is about.

XeTeX can *with appropriate font support* accept nearly any Unicode point
in its input.  But very few Unicode points are treated specially by XeTeX
as such, and I don't think U+00A0 should be one of them.
-- 
Matthew Skala
msk...@ansuz.sooke.bc.ca People before principles.
http://ansuz.sooke.bc.ca/


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-14 Thread Karljurgen Feuerherm
I didn't say anything about U+00A0 one way or the other

Keeping in mind that the purpose of this software is to get work done,
and not to fulfil anyone's philosophical notions of software, my general
feeling is that:

* Xe(La)TeX should support plain text characters--for *my* present
purpose, meaning characters which are printable, pure and simple,
regardless of where in the Unicode space they are; as far as I know,
this is the case now (and my case in point was more or less just aimed
at this issue);

* it should support whatever other characters are necessary to complex
rendering, if it doesn't already;

* optionally it can/could support whatever else, as the in-the-flesh
maintainers of the package have time and leisure to implement.

I said 'feel', because it seems to me all very well for the rest of us
to debate philosophy back and forth, but unless we're doing the actual
work

As someone has already pointed out, lots of what is in Unicode is there
because it is UNI-code. It may very well have outlived its usefulness,
at least in the context of Xe(La)TeX doing the work one would like it to
do. Just because something is in Unicode doesn't mean one has to want to
use it. In fact, the more unnecessary things one implements, the better
the chance of instability.

There are no doubt multiple ways to achieve this pragmatically stated
goal. I don't feel any vested interest in dictating to anyone the
preference for how to go about it.

K

 On Mon, Nov 14, 2011 at  2:15 PM, in message
alpine.lnx.2.00.141312201.3...@tetsu.ansuz.sooke.bc.ca,
msk...@ansuz.sooke.bc.ca wrote:
 On Mon, 14 Nov 2011, Karljurgen Feuerherm wrote:
 I use U+12000 and above regularly, as a case in point...

 Do you think that basic formatting control functions should be bound
to
 code points in that range, as the preferred way of accessing those
 functions?  Let's not lose track of what this discussion is about.

 XeTeX can *with appropriate font support* accept nearly any Unicode
point
 in its input.  But very few Unicode points are treated specially by
XeTeX
 as such, and I don't think U+00A0 should be one of them.
 --
 Matthew Skala
 msk...@ansuz.sooke.bc.ca People before principles.
 http://ansuz.sooke.bc.ca/


 --
 Subscriptions, Archive, and List information, etc.:
   http://tug.org/mailman/listinfo/xetex



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Whitespace in input

2011-11-14 Thread Philip TAYLOR



msk...@ansuz.sooke.bc.ca wrote:

various points with which I have no reason to disagree at this time, followed 
by


2. Inevitably, people will include invalid characters in TeX input; and
U+00A0 is an invalid character for TeX input.


Firstly (as is clear from the list on which we are discussing
this), we are not discussing TeX but XeTeX.  Secondly, even
if we were discussing TeX, on what basis do you claim that
U+00A0 is invalid ?  And if you assert that it is, /a priori/,
invalid for TeX, and if your reasons for that assertion are
sound, do they also support the assertion that it is, /a priori/,
invalid for XeTeX ?

Remainder snipped, so that we can debate one point at a time.

Philip Taylor


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex