Re: [XeTeX] Whitespace in input
Hi Zdenek, I do not think anybody disputes the fact that characters are not glyphs. The confusion arises that a character in CS is well defined and has a history. To be more exact it is just one byte in size so that there can be only 256 characters. Unicode has change all this. and we have a unicode character which is of different sizes depending on the unicode encoding used. It gets even hairier as in unicode several unicode characters can be combined (composed). the result to be output is known as a glyph! The average user considers a glyph to be the same as a letter and thereby a character. Now, in order to process the glyphs with a computer it must be decomposed back to unicode. How well this is done depends of the system its self. If the system is not fully unicode aware and implements in properly then there will be problems. What adds to the complexity of the problem is that not all fonts used for displaying unicode contain all code points, Thereby, creating your many to many decomposition. As for getting junk when copying unicode, just copy between to text using different fonts, where one font does not contain the glyph. The only true way to master this problem is if the computer world would go completely full unicode with fonts support the full unicode code set! That is impractical for the time being. The only advise I can give is choose your tools wisely. regards Keith. Am 18.11.2011 um 23:51 schrieb Zdenek Wagner: 2011/11/18 maxwell maxw...@umiacs.umd.edu: On Fri, 18 Nov 2011 13:52:56 +0100, Zdenek Wagner zdenek.wag...@gmail.com wrote: 2011/11/18 Philip TAYLOR p.tay...@rhul.ac.uk: Is it safe to assume that these code listings are restricted to the ASCII character set ? If so, yes, spaces are likely to be a problem, but if the code listing can also include ligature- digraphs, then these are likely to prove even more problematic. If the code listing is typeset in a fixed width font, it is usually no problem. I copied a few code samples from books in PDF, most of them were typeset by TeX. If I want to copy text in Devanagari, it is almost impossible. Besides TeX, Dr. Knuth also invented Literate Programming. In our own project, we use LP to extract the code listings from the original source code, rather than from the PDF. One advantage is that in addition to the re-ordering at the character level (mentioned in part of Zdenek's email that I didn't copy over), this allows re-ordering at any arbitrary level, This is a demonstration that glyphs are not the same as characters. I will startt with a simpler case and will not put Devanagari to the mail message. If you wish to write a syllable RU, you have to add a dependent vowel (matra) U to a consonant RA. There is a ligature RU, so in PDF you will not see RA consonant with U matra but a RU glyph. Similarly, TRA is a single glyph representing the following characters: TA+VIRAMA+RA. The toUnicode map supports 1:1 and 1:many mappings thus it is possible to handle these cases when copying text from a PDF or when searching. More difficult case is I matra (short dependent vowel I). As a character it must always follow a consonant (this is a general rule for all dependent vowels) but visually (as a glyph) it precedes the consonant group after which it is pronounced. The sample word was kitab (it means a book). In Unicode (as characters) the order is KA+I-matra+TA+A-matra(long)+BA. Visually I-matra precedes KA. XeTeX (knowing that it works with a Devanagari script) runs the character sequence through ICU and the result is the glyph sequence. The original sequence is lost so that when the text is copied from PDF, we get (not exactly) i*katab. Microsoft suggested what additional characters should appear in Indic OpenType fonts. One of them is a dotted ring which denotes a missing consonant. I-matra must always follow a consonant (in character order). If it is moved to the beginning of a word, it is wrong. If you paste it to a text editor, the OpenType rendering engine should display a missing consonant as a dotted ring (if it is present in the font). In character order the dotted ring will precede I-matra but in visual (glyph) order it will be just opposite. Thus the asterisk shows the place where you will see the dotted circle. This is just one simple case. I-matra may follow a consonant group, such as in word PRIY (dear) which is PA+VIRAMA+RA+I-matra+YA or STRIYOCIT (good for women) which is SA+VIRAMA+TA+VIRAMA+RA+I-matra+YA+O-matra+CA+I-matra+TA. Both words will start with the I-matra glyph. The latter will contain two ordering bugs after copypaste. Consider also word MURTI (statue) which is a sequence of characters MA+U-matra(long)+RA+VIRAMA+TA+I-matra. Visually the long U-matra will
Re: [XeTeX] Whitespace in input
Keith J. Schultz wrote: I do not think anybody disputes the fact that characters are not glyphs. The confusion arises that a character in CS is well defined and has a history. To be more exact it is just one byte in size so that there can be only 256 characters. Sorry, Keith, this is patently untrue. Replace is by was once and you get a little closer to the truth, but you still completely ignore issues such as the difference between (say) EBCDIC and ASCII. CDC machines used a 60-bit word, and one character was six bits, not eight. And before the advent of the extended character set, a character consisted of seven bits plus a parity bit, thus yielding at most 128 characters of which 32 were reserved for control functions. The average user considers a glyph to be the same as a letter and thereby a character. It is rarely safe to believe that one knows what the average user thinks ... Now, in order to process the glyphs with a computer it must be decomposed back to unicode. But one rarely, if ever, processes glyphs; the glyphs are the end result, not the input. Glyph processing does become necessary in languages such as Arabic, where context has a major impact on the way in which the individual glyphs are presented, but in Western languages the nearest we get to glyph processing is in the formation of ligature digraphs. How well this is done depends of the system its self. If the system is not fully unicode aware and implements in properly then there will be problems. What adds to the complexity of the problem is that not all fonts used for displaying unicode contain all code points, Thereby, creating your many to many decomposition. As for getting junk when copying unicode, just copy between to text using different fonts, where one font does not contain the glyph. The only true way to master this problem is if the computer world would go completely full unicode with fonts support the full unicode code set! I personally hope that this does not happen, and that before then we have an Omnicode consortium to review the mistakes of Unicode and to address them in a future, more orthogonal, more consistent, specification. Philip Taylor -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
2011/11/19 Ross Moore ross.mo...@mq.edu.au: Hi Zdenek, On 19/11/2011, at 10:30 AM, Zdenek Wagner wrote: /ActualText is your friend here. You tag the content and provide the string that you want to appear with Copy/Paste as the value associated to a dictionary key. I do not know whether the PDF specification has evolved since I read it the last time. /ActualText allows only single-byte characters, ie those with codes between 0 and 255, not arbitrary Unicode characters. That is most certainly not true. You code up UTF-16BE as Hex strings. Here is a snippet of the (tagged-pdfLaTeX) source coding from the main example that I showed in my TUG2011 talk. The URL for the video of the talk is given in several of my previous emails: Thank you for the sample. I will try again when I have more time. Maybe there is a stupid bug in my old code. As a matter of fact, when playing with /ActualText I knew much less than now. \SMC attr{/ActualTextFEFFD835DC4F\TPDFaloud{1D44F}} noendtext 254 {mi}% b% _{\noEMC% \TPDFsub \SMC attr{/ActualTextFEFFD835DC58\TPDFaloud{1D458}} noendtext 255 {mi}% k% \EMC }^{\EMC \SMC attr{/ActualText( )} noendtext 256 {Span}% \pdffakespace \EMC }% \TPDFpopbrack \SMC attr{/ActualTextFEFF0029\TPDFaloud{0029}} noendtext 257 {mo}% \Bigr)% Inside the resulting PDF, this content looks like: 1 0 0 1 4.902 2.463 cm /mi /MCID 10 /ActualTextFEFFD835DC4F/Alt( , b , ) BDC BT /F11 9.9626 Tf [(b)]TJ ET EMC 1 0 0 1 4.276 4.114 cm /Span /MCID 11 /ActualText( ) BDC BT /F103 1 Tf [( )]TJ ET EMC 1 0 0 1 0 -6.577 cm /mi /MCID 12 /ActualTextFEFFD835DC58/Alt( sub k , ) BDC BT /F10 6.9738 Tf [(k)]TJ ET EMC 1 0 0 1 4.901 2.463 cm /mo /MCID 13 /Alt( close bracket:, , ) BDC The full PDF passes all of Adobe's validation tests for correct PDF syntax, Accessible Content, PDF/A-1b compliance. More particularly: /mi /MCID 10 /ActualTextFEFFD835DC4F/Alt( , b , ) BDC BT /F11 9.9626 Tf [(b)]TJ ET EMC expresses a math-italic 'b' as : 1. the glyph in the position of letter 'b' (in CMMI10 font); 2. to be spoken aloud as , b , where commas indicate a slight pause 3. to Copy/Paste as the surrogate pair Ux0D835 Ux0DC4F equivalent to a Plane-1 math-italic character 'b' . The /MCID key is necessary for tagged PDF, but the /Alt and /ActualText should work independently to full tagging. The '/mi' is immaterial; it could equally well be '/Span'. /ActualText is demonstrated on German hyphenated words such as Zucker which is hyphenated as Zuk- ker. I have tried to put /ActualText manually via a special, I could see it in the PDF file but it did not work. Yes, because it is quite important to position the tagging pieces correctly within the PDF content stream. It has to balance correctly with BT ... ET and the BDC ... EMC operator pairs, and there may be other subtle requirements. Certainly it cannot be done with just a single \special . There needs to be stuff both before and after the content that causes actual glyphs to be displayed. Just using \pdfliteral is not sufficient with pdfTeX; we needed a special modification that allowed the /mi ...BDC and EMC to fit snuggly around the BT ... ET . There could be a similar problem with XeTeX's \special{pdf:literal ... } (or whatever is the syntax). This is the issue that I was trying to discuss with JK in 2009 or 2010. When converting a white space to a space character some [complex] heuristics is needed while proper conversion of glyphs to characters of Indic scripts require just a few strict rules. The ligatures as TRA have to appear in the toUnicode map, otherwise its meaning will be unclear. If you see the I-matra, go to the last consonant in the sequence and put the I-matra character there. If you see the RA glyph at the right edge of a syllable, go back to the leftmost consonant in the group and prepend RA+VIRAMA there. This is all what has to be done with Devanagari. Other Indic scripts contain two-part vowels but the rules will be similarly simple. We should not be forced to double the size of the PDF file. AR and other PDF rendering programs should learn these simple rules and use them when extracting text. If you can provide the UTF-16BE Hex representation of these, I can create a PDF using it as the /ActualText replacement for some arbitrary string of letters. This will test whether this is a viable approach for Devanagari. If so, then it is a matter of working out how to expand this for a full solution. There is a macro package that can do this with pdfTeX, and it is a vital part of my Tagged PDF work for mathematics. Also, I have an example where the CJK.sty package is extended to tag Chinese characters built from multiple glyphs so that Copy/Paste works correctly (modulo PDF reader quirks). Not sure about XeTeX. I once tried to talk with Jonathan Kew
Re: [XeTeX] Whitespace in input
2011/11/19 Keith J. Schultz keithjschu...@web.de: Hi Zdenek, I do not think anybody disputes the fact that characters are not glyphs. The confusion arises that a character in CS is well defined and has a history. To be more exact it is just one byte in size so that there can be only 256 characters. Unicode has change all this. and we have a unicode character which is of different sizes depending on the unicode encoding used. It gets even hairier as in unicode several unicode characters can be combined (composed). the result to be output is known as a glyph! The average user considers a glyph to be the same as a letter and thereby a character. Now, in order to process the glyphs with a computer it must be decomposed back to unicode. How well this is done depends of the system its self. If the system is not fully unicode aware and implements in properly then there will be problems. What adds to the complexity of the problem is that not all fonts used for displaying unicode contain all code points, Thereby, creating your many to many decomposition. No, conversion of a sequence of glyphs to a sequence of unicode codepoints has little to do with fonts. Position of RU ligature in the font may differ, but it is handled easily by the toUnicode map. Conjunct STA may also occupy different position in different fonts but it can always be printed using two glyphs, half-SA + TA. In general, the half forms should be decoded as the full form followed by VIRAMA. This makes the toUnicode table smaller and leads to correct results. The only problem is correct ordering of a few characters. As for getting junk when copying unicode, just copy between to text using different fonts, where one font does not contain the glyph. When performing copypaste or text search in PDF, I am not interested in glyphs but in characters. I do not care what glyphs will be displayed. If I copy the text to OpenOffice, I can change the font later and if the codepoints were transferred correctly, I will see the text (it was true even with OpenOffice 1.x, I tried many years ago). If I copy the text to gedit, ontconfig will automatically find a font for displaying the characters not present in the current font. I still have to read the fontconfig manual in order to find how to configure its searching algorithm. Arabic fonts may be a problem especially if you wish to use Arabic, Persian and Urdu. Now I know that I have to force fontonfic to select automatically SIL Scheherezade because it contains all characters. I can thus use both U+0643 and U+06A. When writing Akbar, I can write it both in Arabic and in Urdu/Farsi. The only true way to master this problem is if the computer world would go completely full unicode with fonts support the full unicode code set! That is impractical for the time being. fontconfig currently has the solution and usually works out of the box. The only advise I can give is choose your tools wisely. regards Keith. Am 18.11.2011 um 23:51 schrieb Zdenek Wagner: 2011/11/18 maxwell maxw...@umiacs.umd.edu: On Fri, 18 Nov 2011 13:52:56 +0100, Zdenek Wagner zdenek.wag...@gmail.com wrote: 2011/11/18 Philip TAYLOR p.tay...@rhul.ac.uk: Is it safe to assume that these code listings are restricted to the ASCII character set ? If so, yes, spaces are likely to be a problem, but if the code listing can also include ligature- digraphs, then these are likely to prove even more problematic. If the code listing is typeset in a fixed width font, it is usually no problem. I copied a few code samples from books in PDF, most of them were typeset by TeX. If I want to copy text in Devanagari, it is almost impossible. Besides TeX, Dr. Knuth also invented Literate Programming. In our own project, we use LP to extract the code listings from the original source code, rather than from the PDF. One advantage is that in addition to the re-ordering at the character level (mentioned in part of Zdenek's email that I didn't copy over), this allows re-ordering at any arbitrary level, This is a demonstration that glyphs are not the same as characters. I will startt with a simpler case and will not put Devanagari to the mail message. If you wish to write a syllable RU, you have to add a dependent vowel (matra) U to a consonant RA. There is a ligature RU, so in PDF you will not see RA consonant with U matra but a RU glyph. Similarly, TRA is a single glyph representing the following characters: TA+VIRAMA+RA. The toUnicode map supports 1:1 and 1:many mappings thus it is possible to handle these cases when copying text from a PDF or when searching. More difficult case is I matra (short dependent vowel I). As a character it must always follow a consonant (this is a general rule for all dependent vowels) but visually (as
Re: [XeTeX] Whitespace in input
Am Sat, 19 Nov 2011 00:30:58 +0100 schrieb Zdenek Wagner: /ActualText is your friend here. You tag the content and provide the string that you want to appear with Copy/Paste as the value associated to a dictionary key. I do not know whether the PDF specification has evolved since I read it the last time. /ActualText allows only single-byte characters, ie those with codes between 0 and 255, not arbitrary Unicode characters. This here works fine with pdflatex + xetex: \documentclass{article} \usepackage{accsupp} \begin{document} \BeginAccSupp{method=hex,unicode,ActualText=20AC}% Euro% \EndAccSupp{}% \BeginAccSupp{method=hex,unicode,ActualText=03B1}% alpha% \EndAccSupp{}% \end{document} -- Ulrike Fischer -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
2011/11/19 Ulrike Fischer ne...@nililand.de: Am Sat, 19 Nov 2011 00:30:58 +0100 schrieb Zdenek Wagner: /ActualText is your friend here. You tag the content and provide the string that you want to appear with Copy/Paste as the value associated to a dictionary key. I do not know whether the PDF specification has evolved since I read it the last time. /ActualText allows only single-byte characters, ie those with codes between 0 and 255, not arbitrary Unicode characters. This here works fine with pdflatex + xetex: Thank you, the package looks useful. \documentclass{article} \usepackage{accsupp} \begin{document} \BeginAccSupp{method=hex,unicode,ActualText=20AC}% Euro% \EndAccSupp{}% \BeginAccSupp{method=hex,unicode,ActualText=03B1}% alpha% \EndAccSupp{}% \end{document} -- Ulrike Fischer -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
Karljürgen G. Feuerherm, PhD Undergraduate Advisor Department of Archaeology and Classical Studies Wilfrid Laurier University 75 University Avenue West Waterloo, Ontario N2L 3C5 Tel. (519) 884-1970 x3193 Fax (519) 883-0991 (ATTN Arch. Classics) On Sat, Nov 19, 2011 at 3:39 AM, in message 4ec76b33.2060...@rhul.ac.uk, Philip TAYLOR p.tay...@rhul.ac.uk wrote: I personally hope that this does not happen, and that before then we have an Omnicode consortium to review the mistakes of Unicode and to address them in a future, more orthogonal, more consistent, specification. Hear, hear! (is that the right spelling?) Wisdom is of course 20/20 hindsight--and the Omnicodists will make their own mistakes... it's inevitable. But still, one should try. K -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
OUCH! I have been hit by a veteran truck drivers truck. ;-)) I concede! I am curious if many still know what a XX-bit word is. Is that term even still used? Turn Unicode needs to be clean up it has become to fragmented. regards Keith. Am 19.11.2011 um 09:39 schrieb Philip TAYLOR: Keith J. Schultz wrote: I do not think anybody disputes the fact that characters are not glyphs. The confusion arises that a character in CS is well defined and has a history. To be more exact it is just one byte in size so that there can be only 256 characters. Sorry, Keith, this is patently untrue. Replace is by was once and you get a little closer to the truth, but you still completely ignore issues such as the difference between (say) EBCDIC and ASCII. CDC machines used a 60-bit word, and one character was six bits, not eight. And before the advent of the extended character set, a character consisted of seven bits plus a parity bit, thus yielding at most 128 characters of which 32 were reserved for control functions. The average user considers a glyph to be the same as a letter and thereby a character. It is rarely safe to believe that one knows what the average user thinks ... Now, in order to process the glyphs with a computer it must be decomposed back to unicode. But one rarely, if ever, processes glyphs; the glyphs are the end result, not the input. Glyph processing does become necessary in languages such as Arabic, where context has a major impact on the way in which the individual glyphs are presented, but in Western languages the nearest we get to glyph processing is in the formation of ligature digraphs. How well this is done depends of the system its self. If the system is not fully unicode aware and implements in properly then there will be problems. What adds to the complexity of the problem is that not all fonts used for displaying unicode contain all code points, Thereby, creating your many to many decomposition. As for getting junk when copying unicode, just copy between to text using different fonts, where one font does not contain the glyph. The only true way to master this problem is if the computer world would go completely full unicode with fonts support the full unicode code set! I personally hope that this does not happen, and that before then we have an Omnicode consortium to review the mistakes of Unicode and to address them in a future, more orthogonal, more consistent, specification. Philip Taylor -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
Am 19.11.2011 um 13:51 schrieb Zdenek Wagner: 2011/11/19 Keith J. Schultz keithjschu...@web.de: As for getting junk when copying unicode, just copy between to text using different fonts, where one font does not contain the glyph. When performing copypaste or text search in PDF, I am not interested in glyphs but in characters. I do not care what glyphs will be displayed. If I copy the text to OpenOffice, I can change the font later and if the codepoints were transferred correctly, I will see the As you say if transferred correctly! text (it was true even with OpenOffice 1.x, I tried many years ago). If I copy the text to gedit, ontconfig will automatically find a font for displaying the characters not present in the current font. I still have to read the fontconfig manual in order to find how to configure its searching algorithm. Arabic fonts may be a problem especially if you wish to use Arabic, Persian and Urdu. Now I know that I have to force fontonfic to select automatically SIL Scheherezade because it contains all characters. I can thus use both U+0643 and U+06A. When writing Akbar, I can write it both in Arabic and in Urdu/Farsi [snip, snip] The only advise I can give is choose your tools wisely. regards Keith. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
On 2011-11-19 14:25, Keith J. Schultz wrote: Perhaps this can be of use: https://github.com/wspr/fontspec/issues/121 Am 19.11.2011 um 13:51 schrieb Zdenek Wagner: 2011/11/19 Keith J. Schultz keithjschu...@web.de mailto:keithjschu...@web.de: As for getting junk when copying unicode, just copy between to text using different fonts, where one font does not contain the glyph. When performing copypaste or text search in PDF, I am not interested in glyphs but in characters. I do not care what glyphs will be displayed. If I copy the text to OpenOffice, I can change the font later and if the codepoints were transferred correctly, I will see the As you say if transferred correctly! text (it was true even with OpenOffice 1.x, I tried many years ago). If I copy the text to gedit, ontconfig will automatically find a font for displaying the characters not present in the current font. I still have to read the fontconfig manual in order to find how to configure its searching algorithm. Arabic fonts may be a problem especially if you wish to use Arabic, Persian and Urdu. Now I know that I have to force fontonfic to select automatically SIL Scheherezade because it contains all characters. I can thus use both U+0643 and U+06A. When writing Akbar, I can write it both in Arabic and in Urdu/Farsi [snip, snip] The only advise I can give is choose your tools wisely. regards Keith. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
2011/11/19 Pander pan...@users.sourceforge.net: On 2011-11-19 14:25, Keith J. Schultz wrote: Perhaps this can be of use: https://github.com/wspr/fontspec/issues/121 As Khaled wrote, it belongs to the engine. ZWJ and ZWNJ are used in Indic scripts and they work fine since I started to use XeTeX in 2008. Am 19.11.2011 um 13:51 schrieb Zdenek Wagner: 2011/11/19 Keith J. Schultz keithjschu...@web.de mailto:keithjschu...@web.de: As for getting junk when copying unicode, just copy between to text using different fonts, where one font does not contain the glyph. When performing copypaste or text search in PDF, I am not interested in glyphs but in characters. I do not care what glyphs will be displayed. If I copy the text to OpenOffice, I can change the font later and if the codepoints were transferred correctly, I will see the As you say if transferred correctly! text (it was true even with OpenOffice 1.x, I tried many years ago). If I copy the text to gedit, ontconfig will automatically find a font for displaying the characters not present in the current font. I still have to read the fontconfig manual in order to find how to configure its searching algorithm. Arabic fonts may be a problem especially if you wish to use Arabic, Persian and Urdu. Now I know that I have to force fontonfic to select automatically SIL Scheherezade because it contains all characters. I can thus use both U+0643 and U+06A. When writing Akbar, I can write it both in Arabic and in Urdu/Farsi [snip, snip] The only advise I can give is choose your tools wisely. regards Keith. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
On Sat, Nov 19, 2011 at 5:19 AM, Keith J. Schultz keithjschu...@web.de wrote: OUCH! I have been hit by a veteran truck drivers truck. ;-)) I concede! I am curious if many still know what a XX-bit word is. Is that term even still used? It will fade out of use until someone decides we need 128-bit words and then will pop in again ;-) Best Wishes, Chris Travers -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
Hi Pihilip, Thoughout, my programming life and experience I have learned that internal structure means nothing, as long as the result is correct when it comes out. As you rightfully point out the problem lies inside how TeX internally handles space characters when adding them to its internal structure. The fact is that initially, TeX was not designed to handle modern typesetting well. (Xe)TeX's internals are partially quite outdated. It is possible to to handle all this new type of spaces in (Xe)TeX, yet it is quite awkward and you have to be a TeXchian to do it properly. My personal opinion is that TeX et al. has to be revamped completely. Ideally, it should get a natural language parser as a front end and the typesetting module as its back-end for its output. Yes, I know this would not be TeX any more and require a complete different structure of the TeX eco-system. Language modules and the like. I you care to discuss this we cam back channel as it would be to OT, here. regards Keith. Am 17.11.2011 um 20:56 schrieb Philip TAYLOR: Ross, I do not dispute your arguments : I was answering Keith's question in an honest way. I (personally) do not think of a space in TeX output as a character at all, because I am steeped in TeX philosophy; but I am quite willing to accept that /if/ the objective is not to produce output for the sake of output, but output for subsequent processing as input by another program, then there /may/ be an argument for outputting a space as a variable-width glyph. However, I do think that what appears in the output stream is a secondary consideration; far more important (IMHO) is how we represent that space /within XeTeX/. There is, I am sure, not a suggestion on the table that we start to treat a conventional space in XeTeX other than as TeX has traditionally treated it, and therefore the real question is (to my mind), do we adopt an extension of this traditional TeX treatment for non-breaking space, thin-space, and any of the other not-quite-standard spaces that Unicode encompasses, or do we look for an alternative model which /might/ be glyph- or character-based ?. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
2011/11/18 Keith J. Schultz keithjschu...@web.de: Hi Pihilip, Thoughout, my programming life and experience I have learned that internal structure means nothing, as long as the result is correct when it comes out. As you rightfully point out the problem lies inside how TeX internally handles space characters when adding them to its internal structure. The fact is that initially, TeX was not designed to handle modern typesetting well. (Xe)TeX's internals are partially quite outdated. It is possible to to handle all this new type of spaces in (Xe)TeX, yet it is quite awkward and you have to be a TeXchian to do it properly. My personal opinion is that TeX et al. has to be revamped completely. Ideally, it should get a natural language parser as a front end and the typesetting module as its back-end for its output. I admit that things could be done better than in nowadays TeX but its complete revamping seems to me as bad investment. I would rather think of an FO processor. Yes, I know this would not be TeX any more and require a complete different structure of the TeX eco-system. Language modules and the like. I you care to discuss this we cam back channel as it would be to OT, here. regards Keith. Am 17.11.2011 um 20:56 schrieb Philip TAYLOR: Ross, I do not dispute your arguments : I was answering Keith's question in an honest way. I (personally) do not think of a space in TeX output as a character at all, because I am steeped in TeX philosophy; but I am quite willing to accept that /if/ the objective is not to produce output for the sake of output, but output for subsequent processing as input by another program, then there /may/ be an argument for outputting a space as a variable-width glyph. However, I do think that what appears in the output stream is a secondary consideration; far more important (IMHO) is how we represent that space /within XeTeX/. There is, I am sure, not a suggestion on the table that we start to treat a conventional space in XeTeX other than as TeX has traditionally treated it, and therefore the real question is (to my mind), do we adopt an extension of this traditional TeX treatment for non-breaking space, thin-space, and any of the other not-quite-standard spaces that Unicode encompasses, or do we look for an alternative model which /might/ be glyph- or character-based ?. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
Zdenek Wagner wrote: I admit that things could be done better than in nowadays TeX but its complete revamping seems to me as bad investment. I would rather think of an FO processor. And I agree with Zdeněk : this discussion will be productive only if we focus on what can be accomplished (w.r.t. spaces) with few or no changes to XeTeX, not on how we might best deal with the whole (intellectually daunting) issue of optimally typesetting Unicode. ** Phil. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
Am Fri, 18 Nov 2011 08:31:28 +1100 schrieb Ross Moore: Yes, that's the point. The goal of TeX is nice typographical appearance. The goal of XML is easy data exchange. If I want to send structured data, I send XML, not PDF. These days people want both. One question which pops up regularly in the TeX-groups is how can I insert a code listing in my pdf so that it can be copied and pasted reliably. Currently this is not easy as the heuristics of the readers can easily loose spaces, you can't encode tabs or a specific number of spaces. Real space characters in the pdf (instead of only visible space) would help here a lot. -- Ulrike Fischer -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
Is it safe to assume that these code listings are restricted to the ASCII character set ? If so, yes, spaces are likely to be a problem, but if the code listing can also include ligature- digraphs, then these are likely to prove even more problematic. ** Phipl. Ulrike Fischer wrote: One question which pops up regularly in the TeX-groups is how can I insert a code listing in my pdf so that it can be copied and pasted reliably. Currently this is not easy as the heuristics of the readers can easily loose spaces, you can't encode tabs or a specific number of spaces. Real space characters in the pdf (instead of only visible space) would help here a lot. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
2011/11/18 Philip TAYLOR p.tay...@rhul.ac.uk: Is it safe to assume that these code listings are restricted to the ASCII character set ? If so, yes, spaces are likely to be a problem, but if the code listing can also include ligature- digraphs, then these are likely to prove even more problematic. If the code listing is typeset in a fixed width font, it is usually no problem. I copied a few code samples from books in PDF, most of them were typeset by TeX. If I want to copy text in Devanagari, it is almost impossible. If I take just a simple Hindi work किताब, the best result I can get will be िकताब (you should se a dotted circle which is not visible in PDF). The reason is that the first two letters are U+0915, U+093F but visually the latter is displayed first. After copying you get the reversed order U+093F, U+0915. This is just one of many problems with Devanagari. The toUnicode map does not help much with Indian scripts. I have never tried to copy Arabic from PDF. Or even the combination of LTR and RTL within a paragraph. ** Phipl. Ulrike Fischer wrote: One question which pops up regularly in the TeX-groups is how can I insert a code listing in my pdf so that it can be copied and pasted reliably. Currently this is not easy as the heuristics of the readers can easily loose spaces, you can't encode tabs or a specific number of spaces. Real space characters in the pdf (instead of only visible space) would help here a lot. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
On Fri, 18 Nov 2011 13:52:56 +0100, Zdenek Wagner zdenek.wag...@gmail.com wrote: 2011/11/18 Philip TAYLOR p.tay...@rhul.ac.uk: Is it safe to assume that these code listings are restricted to the ASCII character set ? If so, yes, spaces are likely to be a problem, but if the code listing can also include ligature- digraphs, then these are likely to prove even more problematic. If the code listing is typeset in a fixed width font, it is usually no problem. I copied a few code samples from books in PDF, most of them were typeset by TeX. If I want to copy text in Devanagari, it is almost impossible. Besides TeX, Dr. Knuth also invented Literate Programming. In our own project, we use LP to extract the code listings from the original source code, rather than from the PDF. One advantage is that in addition to the re-ordering at the character level (mentioned in part of Zdenek's email that I didn't copy over), this allows re-ordering at any arbitrary level, even entire sections of program code. (We happen to be using XML to contain the source of both our text and our programming language constructs, but that's a different issue.) I agree that it would be nice to be able to reliably copy Unicode text from the PDF, but (a) that issue isn't confined to program listings, and (b) that would only solve the character ordering part of the problem. Mike Maxwell -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
2011/11/18 maxwell maxw...@umiacs.umd.edu: On Fri, 18 Nov 2011 13:52:56 +0100, Zdenek Wagner zdenek.wag...@gmail.com wrote: 2011/11/18 Philip TAYLOR p.tay...@rhul.ac.uk: Is it safe to assume that these code listings are restricted to the ASCII character set ? If so, yes, spaces are likely to be a problem, but if the code listing can also include ligature- digraphs, then these are likely to prove even more problematic. If the code listing is typeset in a fixed width font, it is usually no problem. I copied a few code samples from books in PDF, most of them were typeset by TeX. If I want to copy text in Devanagari, it is almost impossible. Besides TeX, Dr. Knuth also invented Literate Programming. In our own project, we use LP to extract the code listings from the original source code, rather than from the PDF. One advantage is that in addition to the re-ordering at the character level (mentioned in part of Zdenek's email that I didn't copy over), this allows re-ordering at any arbitrary level, This is a demonstration that glyphs are not the same as characters. I will startt with a simpler case and will not put Devanagari to the mail message. If you wish to write a syllable RU, you have to add a dependent vowel (matra) U to a consonant RA. There is a ligature RU, so in PDF you will not see RA consonant with U matra but a RU glyph. Similarly, TRA is a single glyph representing the following characters: TA+VIRAMA+RA. The toUnicode map supports 1:1 and 1:many mappings thus it is possible to handle these cases when copying text from a PDF or when searching. More difficult case is I matra (short dependent vowel I). As a character it must always follow a consonant (this is a general rule for all dependent vowels) but visually (as a glyph) it precedes the consonant group after which it is pronounced. The sample word was kitab (it means a book). In Unicode (as characters) the order is KA+I-matra+TA+A-matra(long)+BA. Visually I-matra precedes KA. XeTeX (knowing that it works with a Devanagari script) runs the character sequence through ICU and the result is the glyph sequence. The original sequence is lost so that when the text is copied from PDF, we get (not exactly) i*katab. Microsoft suggested what additional characters should appear in Indic OpenType fonts. One of them is a dotted ring which denotes a missing consonant. I-matra must always follow a consonant (in character order). If it is moved to the beginning of a word, it is wrong. If you paste it to a text editor, the OpenType rendering engine should display a missing consonant as a dotted ring (if it is present in the font). In character order the dotted ring will precede I-matra but in visual (glyph) order it will be just opposite. Thus the asterisk shows the place where you will see the dotted circle. This is just one simple case. I-matra may follow a consonant group, such as in word PRIY (dear) which is PA+VIRAMA+RA+I-matra+YA or STRIYOCIT (good for women) which is SA+VIRAMA+TA+VIRAMA+RA+I-matra+YA+O-matra+CA+I-matra+TA. Both words will start with the I-matra glyph. The latter will contain two ordering bugs after copypaste. Consider also word MURTI (statue) which is a sequence of characters MA+U-matra(long)+RA+VIRAMA+TA+I-matra. Visually the long U-matra will appear as an accent below the MA glyph. The next glyph will be I-matra followed by TA followed by RA shown as an upper accent at the right edge of the syllable. Generally in RA+VIRAMA+consonant+matra the RA glyph appears at the end of the syllable although locically (in character order) it belongs to the beginning. These cases cannot be solved by toUnicode map because many-to-many mappings are not allowed. Moreover, a huge amount of mappings will be needed. It would be better to do the reverse processing independent of toUnicode mappings, to use ICU or Pango or Uniscribe or whatever to analyze the glyphs and convert them to characters. The rules are unambiguous but AR does not do it. We discuss nonbreakable spaces while we are not yet able to convert properly printable glyphs to characters when doing copypaste from PDF... -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
Hi Zdenek, On 19/11/2011, at 9:51 AM, Zdenek Wagner wrote: This is a demonstration that glyphs are not the same as characters. I will startt with a simpler case and will not put Devanagari to the mail message. If you wish to write a syllable RU, you have to add a dependent vowel (matra) U to a consonant RA. There is a ligature RU, so in PDF you will not see RA consonant with U matra but a RU glyph. Similarly, TRA is a single glyph representing the following characters: TA+VIRAMA+RA. The toUnicode map supports 1:1 and 1:many mappings thus it is possible to handle these cases when copying text from a PDF or when searching. More difficult case is I matra (short dependent vowel I). As a character it must always follow a consonant (this is a general rule for all dependent vowels) but visually (as a glyph) it precedes the consonant group after which it is pronounced. The sample word was kitab (it means a book). In Unicode (as characters) the order is KA+I-matra+TA+A-matra(long)+BA. Visually I-matra precedes KA. XeTeX (knowing that it works with a Devanagari script) runs the character sequence through ICU and the result is the glyph sequence. The original sequence is lost so that when the text is copied from PDF, we get (not exactly) i*katab. /ActualText is your friend here. You tag the content and provide the string that you want to appear with Copy/Paste as the value associated to a dictionary key. There is a macro package that can do this with pdfTeX, and it is a vital part of my Tagged PDF work for mathematics. Also, I have an example where the CJK.sty package is extended to tag Chinese characters built from multiple glyphs so that Copy/Paste works correctly (modulo PDF reader quirks). Not sure about XeTeX. I once tried to talk with Jonathan Kew about what would be needed to implement this properly, but he got totally the wrong idea concerning glyphs and characters, and what was needed to be done internally and what by macros. The conversation went nowhere. Microsoft suggested what additional characters should appear in Indic OpenType fonts. One of them is a dotted ring which denotes a missing consonant. I-matra must always follow a consonant (in character order). If it is moved to the beginning of a word, it is wrong. If you paste it to a text editor, the OpenType rendering engine should display a missing consonant as a dotted ring (if it is present in the font). In character order the dotted ring will precede I-matra but in visual (glyph) order it will be just opposite. Thus the asterisk shows the place where you will see the dotted circle. This is just one simple case. I-matra may follow a consonant group, such as in word PRIY (dear) which is PA+VIRAMA+RA+I-matra+YA or STRIYOCIT (good for women) which is SA+VIRAMA+TA+VIRAMA+RA+I-matra+YA+O-matra+CA+I-matra+TA. Both words will start with the I-matra glyph. The latter will contain two ordering bugs after copypaste. Consider also word MURTI (statue) which is a sequence of characters This sounds like each word needs its own /ActualText . So some intricate programming is certainly necessary. But \XeTeXinterchartoks (is that the right spelling?) should make this possible. MA+U-matra(long)+RA+VIRAMA+TA+I-matra. Visually the long U-matra will appear as an accent below the MA glyph. The next glyph will be I-matra followed by TA followed by RA shown as an upper accent at the right edge of the syllable. Generally in RA+VIRAMA+consonant+matra the RA glyph appears at the end of the syllable although locically (in character order) it belongs to the beginning. These cases cannot be solved by toUnicode map because many-to-many mappings are not allowed. Agreed. /ToUnicode is not the right PDF construction for this. Moreover, a huge amount of mappings will be needed. It would be better to do the reverse processing independent of toUnicode mappings, to use ICU or Pango or Uniscribe or whatever to analyze the glyphs and convert them to characters. The rules are unambiguous but AR does not do it. Having an external pre-procesor is what I do for tagging mathematics. It seems like a similarly intricate problem here. We discuss nonbreakable spaces while we are not yet able to convert properly printable glyphs to characters when doing copypaste from PDF... :-) -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz Hope this helps, Ross Ross Moore ross.mo...@mq.edu.au Mathematics Department office: E7A-419 Macquarie University tel: +61 (0)2 9850 8955 Sydney, Australia 2109 fax: +61 (0)2 9850 8114 -- Subscriptions, Archive, and List information,
Re: [XeTeX] Whitespace in input
2011/11/19 Ross Moore ross.mo...@mq.edu.au: Hi Zdenek, On 19/11/2011, at 9:51 AM, Zdenek Wagner wrote: This is a demonstration that glyphs are not the same as characters. I will startt with a simpler case and will not put Devanagari to the mail message. If you wish to write a syllable RU, you have to add a dependent vowel (matra) U to a consonant RA. There is a ligature RU, so in PDF you will not see RA consonant with U matra but a RU glyph. Similarly, TRA is a single glyph representing the following characters: TA+VIRAMA+RA. The toUnicode map supports 1:1 and 1:many mappings thus it is possible to handle these cases when copying text from a PDF or when searching. More difficult case is I matra (short dependent vowel I). As a character it must always follow a consonant (this is a general rule for all dependent vowels) but visually (as a glyph) it precedes the consonant group after which it is pronounced. The sample word was kitab (it means a book). In Unicode (as characters) the order is KA+I-matra+TA+A-matra(long)+BA. Visually I-matra precedes KA. XeTeX (knowing that it works with a Devanagari script) runs the character sequence through ICU and the result is the glyph sequence. The original sequence is lost so that when the text is copied from PDF, we get (not exactly) i*katab. /ActualText is your friend here. You tag the content and provide the string that you want to appear with Copy/Paste as the value associated to a dictionary key. I do not know whether the PDF specification has evolved since I read it the last time. /ActualText allows only single-byte characters, ie those with codes between 0 and 255, not arbitrary Unicode characters. /ActualText is demonstrated on German hyphenated words such as Zucker which is hyphenated as Zuk- ker. I have tried to put /ActualText manually via a special, I could see it in the PDF file but it did not work. When converting a white space to a space character some [complex] heuristics is needed while proper conversion of glyphs to characters of Indic scripts require just a few strict rules. The ligatures as TRA have to appear in the toUnicode map, otherwise its meaning will be unclear. If you see the I-matra, go to the last consonant in the sequence and put the I-matra character there. If you see the RA glyph at the right edge of a syllable, go back to the leftmost consonant in the group and prepend RA+VIRAMA there. This is all what has to be done with Devanagari. Other Indic scripts contain two-part vowels but the rules will be similarly simple. We should not be forced to double the size of the PDF file. AR and other PDF rendering programs should learn these simple rules and use them when extracting text. There is a macro package that can do this with pdfTeX, and it is a vital part of my Tagged PDF work for mathematics. Also, I have an example where the CJK.sty package is extended to tag Chinese characters built from multiple glyphs so that Copy/Paste works correctly (modulo PDF reader quirks). Not sure about XeTeX. I once tried to talk with Jonathan Kew about what would be needed to implement this properly, but he got totally the wrong idea concerning glyphs and characters, and what was needed to be done internally and what by macros. The conversation went nowhere. Microsoft suggested what additional characters should appear in Indic OpenType fonts. One of them is a dotted ring which denotes a missing consonant. I-matra must always follow a consonant (in character order). If it is moved to the beginning of a word, it is wrong. If you paste it to a text editor, the OpenType rendering engine should display a missing consonant as a dotted ring (if it is present in the font). In character order the dotted ring will precede I-matra but in visual (glyph) order it will be just opposite. Thus the asterisk shows the place where you will see the dotted circle. This is just one simple case. I-matra may follow a consonant group, such as in word PRIY (dear) which is PA+VIRAMA+RA+I-matra+YA or STRIYOCIT (good for women) which is SA+VIRAMA+TA+VIRAMA+RA+I-matra+YA+O-matra+CA+I-matra+TA. Both words will start with the I-matra glyph. The latter will contain two ordering bugs after copypaste. Consider also word MURTI (statue) which is a sequence of characters This sounds like each word needs its own /ActualText . So some intricate programming is certainly necessary. But \XeTeXinterchartoks (is that the right spelling?) should make this possible. MA+U-matra(long)+RA+VIRAMA+TA+I-matra. Visually the long U-matra will appear as an accent below the MA glyph. The next glyph will be I-matra followed by TA followed by RA shown as an upper accent at the right edge of the syllable. Generally in RA+VIRAMA+consonant+matra the RA glyph appears at the end of the syllable although locically (in character order) it belongs to the beginning. These cases cannot be solved by toUnicode map because many-to-many mappings are not
Re: [XeTeX] Whitespace in input
Hi Zdenek, On 19/11/2011, at 10:30 AM, Zdenek Wagner wrote: /ActualText is your friend here. You tag the content and provide the string that you want to appear with Copy/Paste as the value associated to a dictionary key. I do not know whether the PDF specification has evolved since I read it the last time. /ActualText allows only single-byte characters, ie those with codes between 0 and 255, not arbitrary Unicode characters. That is most certainly not true. You code up UTF-16BE as Hex strings. Here is a snippet of the (tagged-pdfLaTeX) source coding from the main example that I showed in my TUG2011 talk. The URL for the video of the talk is given in several of my previous emails: \SMC attr{/ActualTextFEFFD835DC4F\TPDFaloud{1D44F}} noendtext 254 {mi}% b% _{\noEMC% \TPDFsub \SMC attr{/ActualTextFEFFD835DC58\TPDFaloud{1D458}} noendtext 255 {mi}% k% \EMC }^{\EMC \SMC attr{/ActualText( )} noendtext 256 {Span}% \pdffakespace \EMC }% \TPDFpopbrack \SMC attr{/ActualTextFEFF0029\TPDFaloud{0029}} noendtext 257 {mo}% \Bigr)% Inside the resulting PDF, this content looks like: 1 0 0 1 4.902 2.463 cm /mi /MCID 10 /ActualTextFEFFD835DC4F/Alt( , b , ) BDC BT /F11 9.9626 Tf [(b)]TJ ET EMC 1 0 0 1 4.276 4.114 cm /Span /MCID 11 /ActualText( ) BDC BT /F103 1 Tf [( )]TJ ET EMC 1 0 0 1 0 -6.577 cm /mi /MCID 12 /ActualTextFEFFD835DC58/Alt( sub k , ) BDC BT /F10 6.9738 Tf [(k)]TJ ET EMC 1 0 0 1 4.901 2.463 cm /mo /MCID 13 /Alt( close bracket:, , ) BDC The full PDF passes all of Adobe's validation tests for correct PDF syntax, Accessible Content, PDF/A-1b compliance. More particularly: /mi /MCID 10 /ActualTextFEFFD835DC4F/Alt( , b , ) BDC BT /F11 9.9626 Tf [(b)]TJ ET EMC expresses a math-italic 'b' as : 1. the glyph in the position of letter 'b' (in CMMI10 font); 2. to be spoken aloud as , b , where commas indicate a slight pause 3. to Copy/Paste as the surrogate pair Ux0D835 Ux0DC4F equivalent to a Plane-1 math-italic character 'b' . The /MCID key is necessary for tagged PDF, but the /Alt and /ActualText should work independently to full tagging. The '/mi' is immaterial; it could equally well be '/Span'. /ActualText is demonstrated on German hyphenated words such as Zucker which is hyphenated as Zuk- ker. I have tried to put /ActualText manually via a special, I could see it in the PDF file but it did not work. Yes, because it is quite important to position the tagging pieces correctly within the PDF content stream. It has to balance correctly with BT ... ET and the BDC ... EMC operator pairs, and there may be other subtle requirements. Certainly it cannot be done with just a single \special . There needs to be stuff both before and after the content that causes actual glyphs to be displayed. Just using \pdfliteral is not sufficient with pdfTeX; we needed a special modification that allowed the /mi ...BDC and EMC to fit snuggly around the BT ... ET . There could be a similar problem with XeTeX's \special{pdf:literal ... } (or whatever is the syntax). This is the issue that I was trying to discuss with JK in 2009 or 2010. When converting a white space to a space character some [complex] heuristics is needed while proper conversion of glyphs to characters of Indic scripts require just a few strict rules. The ligatures as TRA have to appear in the toUnicode map, otherwise its meaning will be unclear. If you see the I-matra, go to the last consonant in the sequence and put the I-matra character there. If you see the RA glyph at the right edge of a syllable, go back to the leftmost consonant in the group and prepend RA+VIRAMA there. This is all what has to be done with Devanagari. Other Indic scripts contain two-part vowels but the rules will be similarly simple. We should not be forced to double the size of the PDF file. AR and other PDF rendering programs should learn these simple rules and use them when extracting text. If you can provide the UTF-16BE Hex representation of these, I can create a PDF using it as the /ActualText replacement for some arbitrary string of letters. This will test whether this is a viable approach for Devanagari. If so, then it is a matter of working out how to expand this for a full solution. There is a macro package that can do this with pdfTeX, and it is a vital part of my Tagged PDF work for mathematics. Also, I have an example where the CJK.sty package is extended to tag Chinese characters built from multiple glyphs so that Copy/Paste works correctly (modulo PDF reader quirks). Not sure about XeTeX. I once tried to talk with Jonathan Kew about what would be needed to implement this properly, but he got totally the wrong idea concerning glyphs and characters, and what was needed to be done internally and what by macros. The conversation went nowhere. -- Zdeněk Wagner Cheers,
Re: [XeTeX] Whitespace in input
O.K. You mention in a later post that you do consider a space as a printable character. I do disagree, in the sense that, even though you actually can not see how many spaces are in a run, that it does have a size and thereby does have a fixed visual affect. I do agree with you, that a space character should, in good layout, be changed to a space of white to accommodate good line breaking. So it is not truly a printable character in text layout. Though, I do prefer inter character spacing a preferable method to achieve a more aesthetically look. Know more to point. Often enough there are conventions that one has to follow concerning the wrapping of words. Most prominent Names. As an example I will use my name Keith J. Schultz. (Yes, this is not the best example and (Xe)Tex has ways of getting around this) Names should not be wrap or should there not be unnecessary space between the parts. Generally, it is O.K. to wrap/line break after the J., but not between Keith and J. so I need a non breaking space between them, also you do not want different space between Keith, J. and Schultz, yet not the same space as used between other words of the line. If the J. bothers you use Johan instead. The same is true of Mrs. Smith. So the use of a non breaking space with given size is advisable for input. Of course, what TeX et al. should output is debatable and it wreaks havoc with TeX's line breaking algorithm. It is often hard to get the desired results. But, the way TeX works this will always be a problem. Yet, when I enter a non-breaking space that is what I want and more often than not a space of fixed size across the board. regards Keith. Am 15.11.2011 um 12:09 schrieb Philip and Le Khanh: Keith J. Schultz wrote: A non.breaking space is to me a printable character, in so far that it is important and must be used to distinguish between word space, et all. If, for you, [a] non.breaking space is a printable character, then presumably that character must be taken from some font. If you take a character from a font, it will have a size, and although it can be combined with kerning rules to adjust its position w.r.t. adjacent characters, the logic for this is fairly restricted. In particular, it cannot take into account the amount by which TeX is seeking to expand or contract spaces on the current line in order to achieve optimal paragraphs. So in your model of the ideal universe, non-breaking Unicode spaces would not behave as do conventional TeX non-breaking spaces (which /do/ expand and contract to assist in TeX's line-breaking), nor would they conform to their Unicode definition where their decomposition is defined as : noBreak SPACE (U+0020) I wonder if you would like to discuss these points ? Philip Taylor -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
Am 17.11.2011 um 11:26 schrieb Keith J. Schultz: O.K. You mention in a later post that you do consider a space as a printable character. This line should read as: You mention in a later post that you consider a space as a non-printable character. I do disagree, in the sense that, even though you actually can not see how many spaces are in a run, that it does have a size and thereby does have a fixed visual affect. I do agree with you, that a space character should, in good layout, be changed to a space of white to accommodate good line breaking. So it is not truly a printable character in text layout. Though, I do prefer inter character spacing a preferable method to achieve a more aesthetically look. Know more to point. Often enough there are conventions that one has to follow concerning the wrapping of words. Most prominent Names. As an example I will use my name Keith J. Schultz. (Yes, this is not the best example and (Xe)Tex has ways of getting around this) Names should not be wrap or should there not be unnecessary space between the parts. Generally, it is O.K. to wrap/line break after the J., but not between Keith and J. so I need a non breaking space between them, also you do not want different space between Keith, J. and Schultz, yet not the same space as used between other words of the line. If the J. bothers you use Johan instead. The same is true of Mrs. Smith. So the use of a non breaking space with given size is advisable for input. Of course, what TeX et al. should output is debatable and it wreaks havoc with TeX's line breaking algorithm. It is often hard to get the desired results. But, the way TeX works this will always be a problem. Yet, when I enter a non-breaking space that is what I want and more often than not a space of fixed size across the board. regards Keith. Am 15.11.2011 um 12:09 schrieb Philip and Le Khanh: Keith J. Schultz wrote: A non.breaking space is to me a printable character, in so far that it is important and must be used to distinguish between word space, et all. If, for you, [a] non.breaking space is a printable character, then presumably that character must be taken from some font. If you take a character from a font, it will have a size, and although it can be combined with kerning rules to adjust its position w.r.t. adjacent characters, the logic for this is fairly restricted. In particular, it cannot take into account the amount by which TeX is seeking to expand or contract spaces on the current line in order to achieve optimal paragraphs. So in your model of the ideal universe, non-breaking Unicode spaces would not behave as do conventional TeX non-breaking spaces (which /do/ expand and contract to assist in TeX's line-breaking), nor would they conform to their Unicode definition where their decomposition is defined as : noBreak SPACE (U+0020) I wonder if you would like to discuss these points ? Philip Taylor -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
Keith J. Schultz wrote: Am 17.11.2011 um 11:26 schrieb Keith J. Schultz: O.K. You mention in a later post that you do consider a space as a printable character. This line should read as: You mention in a later post that you consider a space as a non-printable character. No, I don't think of it as a character at all, when we are talking about typeset output (as opposed to ASCII (or Unicode) input). Clearly it is a character on input, but unless it generates a glyph in the output stream (which TeX does not, for normal spaces) then it is not a character (/qua/ character) on output but rather a formatting instruction not dissimilar to (say) end-of-line. ** Phil. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
Hi Phil, On 17/11/2011, at 23:53, Philip TAYLOR p.tay...@rhul.ac.uk wrote: Keith J. Schultz wrote: You mention in a later post that you do consider a space as a printable character. This line should read as: You mention in a later post that you consider a space as a non-printable character. No, I don't think of it as a character at all, when we are talking about typeset output (as opposed to ASCII (or Unicode) input). This is fine, when all that you require of your output is that it be visible on a printed page. But modern communication media goes much beyond that. A machine needs to be able to tell where words and lines end, reflowing paragraphs when appropriate and able to produce a flat extraction of all the text, perhaps also with some indication of the purpose of that text (e.g. by structural tagging). In short, what is output for one format should also be able to serve as input for another. Thus the space certainly does play the role of an output character – though the presence of a gap in the positioning of visible letters may serve this role in many, but not all, circumstances. Clearly it is a character on input, but unless it generates a glyph in the output stream (which TeX does not, for normal spaces) then it is not a character (/qua/ character) on output but rather a formatting instruction not dissimilar to (say) end-of-line. But a formatting instruction for one program cannot serve as reliable input for another. A heuristic is then needed, to attempt to infer that a programming instruction must have been used, and guess what kind of instruction it might have been. This is not 100% reliable, so is deprecated in modern methods of data storage and document formats. XML based formats use tagging, rather that programming instructions. This is the modern way, which is used extensively for communicating data between different software systems. ** Phil. TeX's strength is in its superior ability to position characters on the page for maximum visual effect. This is done by producing detailed programming instructions within the content stream of the PDF output. However, this is not enough to meet the needs of formats such as EPUB, non-visual reading software, archival formats, searchability, and other needs. Tagged PDF can be viewed as Adobe's response to address these requirements as an extension of the visual aspects of the PDF format. It is a direction in which TeX can (and surely must) move, to stay relevant within the publishing industry of the future. Hope this helps, Ross -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
Ross, I do not dispute your arguments : I was answering Keith's question in an honest way. I (personally) do not think of a space in TeX output as a character at all, because I am steeped in TeX philosophy; but I am quite willing to accept that /if/ the objective is not to produce output for the sake of output, but output for subsequent processing as input by another program, then there /may/ be an argument for outputting a space as a variable-width glyph. However, I do think that what appears in the output stream is a secondary consideration; far more important (IMHO) is how we represent that space /within XeTeX/. There is, I am sure, not a suggestion on the table that we start to treat a conventional space in XeTeX other than as TeX has traditionally treated it, and therefore the real question is (to my mind), do we adopt an extension of this traditional TeX treatment for non-breaking space, thin-space, and any of the other not-quite-standard spaces that Unicode encompasses, or do we look for an alternative model which /might/ be glyph- or character-based ?. ** Phil. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
2011/11/17 Ross Moore ross.mo...@mq.edu.au: Hi Phil, On 17/11/2011, at 23:53, Philip TAYLOR p.tay...@rhul.ac.uk wrote: Keith J. Schultz wrote: You mention in a later post that you do consider a space as a printable character. This line should read as: You mention in a later post that you consider a space as a non-printable character. No, I don't think of it as a character at all, when we are talking about typeset output (as opposed to ASCII (or Unicode) input). This is fine, when all that you require of your output is that it be visible on a printed page. But modern communication media goes much beyond that. A machine needs to be able to tell where words and lines end, reflowing paragraphs when appropriate and able to produce a flat extraction of all the text, perhaps also with some indication of the purpose of that text (e.g. by structural tagging). In short, what is output for one format should also be able to serve as input for another. Thus the space certainly does play the role of an output character - though the presence of a gap in the positioning of visible letters may serve this role in many, but not all, circumstances. Clearly it is a character on input, but unless it generates a glyph in the output stream (which TeX does not, for normal spaces) then it is not a character (/qua/ character) on output but rather a formatting instruction not dissimilar to (say) end-of-line. But a formatting instruction for one program cannot serve as reliable input for another. A heuristic is then needed, to attempt to infer that a programming instruction must have been used, and guess what kind of instruction it might have been. This is not 100% reliable, so is deprecated in modern methods of data storage and document formats. XML based formats use tagging, rather that programming instructions. This is the modern way, which is used extensively for communicating data between different software systems. Yes, that's the point. The goal of TeX is nice typographical appearance. The goal of XML is easy data exchange. If I want to send structured data, I send XML, not PDF. ** Phil. TeX's strength is in its superior ability to position characters on the page for maximum visual effect. This is done by producing detailed programming instructions within the content stream of the PDF output. However, this is not enough to meet the needs of formats such as EPUB, non-visual reading software, archival formats, searchability, and other needs. Tagged PDF can be viewed as Adobe's response to address these requirements as an extension of the visual aspects of the PDF format. It is a direction in which TeX can (and surely must) move, to stay relevant within the publishing industry of the future. Hope this helps, Ross No, it does not help. Remember that tha last (almost) portable version of PDF is 1.2. If you are to open tagged PDF or even PDF with a toUnicode map or a colorspace other than RGB or CMYK in Acrobat Reader 3, it displays a fatal error and dies. I reported it to Adobe in March 2001 and they did nothing. I even reported another fatal bug in January 2001. I sent sample files but nothing happened, Adobe just stopped development of Acrobat Reader at buggy version 3 for some operating systems. Why do you so much rely on Adobe? When exchanging structured documents I will always do it in XML and never create tagged PDF because I know that some users will be unable to read them by Adobe Acrobat Reader. I do not wish to make them dependent on ghostscript and similar tools. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
Hello Zdenek, On 18/11/2011, at 7:49 AM, Zdenek Wagner wrote: But a formatting instruction for one program cannot serve as reliable input for another. A heuristic is then needed, to attempt to infer that a programming instruction must have been used, and guess what kind of instruction it might have been. This is not 100% reliable, so is deprecated in modern methods of data storage and document formats. XML based formats use tagging, rather that programming instructions. This is the modern way, which is used extensively for communicating data between different software systems. Yes, that's the point. The goal of TeX is nice typographical appearance. The goal of XML is easy data exchange. If I want to send structured data, I send XML, not PDF. These days people want both. ** Phil. TeX's strength is in its superior ability to position characters on the page for maximum visual effect. This is done by producing detailed programming instructions within the content stream of the PDF output. However, this is not enough to meet the needs of formats such as EPUB, non-visual reading software, archival formats, searchability, and other needs. Tagged PDF can be viewed as Adobe's response to address these requirements as an extension of the visual aspects of the PDF format. It is a direction in which TeX can (and surely must) move, to stay relevant within the publishing industry of the future. Hope this helps, Ross No, it does not help. Remember that tha last (almost) portable version of PDF is 1.2. If you are to open tagged PDF or even PDF with a toUnicode map or a colorspace other than RGB or CMYK in Acrobat Reader 3, it displays a fatal error and dies. I reported it to Adobe in March 2001 and they did nothing. What else would you expect? AR is at version 10 now. On Linux it is at version 9 now, indeed 9.4.6 is current. You don't expect TeX formats prior to TeX3 to handle non-ascii characters, so why would you expect other people's older software versions to handle documents written for later formats? I even reported another fatal bug in January 2001. I sent sample files but nothing happened, Adobe just stopped development of Acrobat Reader at buggy version 3 for some operating systems. Why should they support OSs that have a limited life-time? Industry moves on. A new computer is very cheap these days, with software that can do things your older one never could do. By all means keep the old one while it still does useful work, but you get another to do things that the older cannot handle. Why do you so much rely on Adobe? When exchanging structured documents I will always do it in XML and never create tagged PDF because ... PDF, as a published standard, is not maintained by Adobe itself these days, yet Adobe continues to provide a free reader, at least for the visual aspects. That makes documents in PDF viewable by everyone (who is only interested in the visual aspect). It is an ISO standard, which publishers will want to use. Most of the people who use (La)TeX are academics or others who need to do a fair amount of publishing, of one kind or another. TeX can be modified to become capable of producing Tagged PDF. (See the videos of my talks.) Free software (Poppler) is being developed to handle most aspects of PDF content, though it hasn't yet progressed enough to support structure tagging. It's surely on the list of things to do. ... I know that some users will be unable to read them by Adobe Acrobat Reader. Why not? It is not Adobe Reader that is holding them back. I do not wish to make them dependent on ghostscript and similar tools. You'll have to give some more details of who you are referring to her, and why their economic circumstances require them to have access to XML-transmitted data, but preclude them from access to other kinds of standard computing software and devices. -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz Hope this helps, Ross Ross Moore ross.mo...@mq.edu.au Mathematics Department office: E7A-419 Macquarie University tel: +61 (0)2 9850 8955 Sydney, Australia 2109 fax: +61 (0)2 9850 8114 -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
Hi Phil, On 18/11/2011, at 6:56 AM, Philip TAYLOR wrote: Ross, I do not dispute your arguments : I was answering Keith's question in an honest way. I (personally) do not think of a space in TeX output as a character at all, because I am steeped in TeX philosophy; but I am quite willing to accept that /if/ the objective is not to produce output for the sake of output, but output for subsequent processing as input by another program, then there /may/ be an argument for outputting a space as a variable-width glyph. However, I do think that what appears in the output stream is a secondary consideration; far more important (IMHO) is how we represent that space /within XeTeX/. Do you realise how XeTeX works? Especially when handling non-Latin-based languages? Essentially it does *nothing at all* after macro expansion. Instead it passes strings of characters (tokens are converted back to characters) to an external process --- namely the font-handling aspects provided by the computers operating system, or other software. What returns is a piece of PDF output, along with height/depth/width of this piece (i.e. a TeX-like box). It is external software, that has been designed to encode the knowledge of how the particular language script is structured. This makes all the detailed description of character placement, perhaps using information contained within the font itself. Indeed for many fonts, there are no such decisions, since the font actually does it itself. All that is needed is to place the character string in the most appropriate position on the page. XeTeX does play a role in determining whether the box fits on the line being built. If not, then hyphenation points come into play, so that alternative break-ups of the character string into smaller pieces must be considered. Why am I giving this detail of a description? ... There is, I am sure, not a suggestion on the table that we start to treat a conventional space in XeTeX other than as TeX has traditionally treated it, and therefore the real question is (to my mind), do we adopt an extension of this traditional TeX treatment for non-breaking space, thin-space, and any of the other not-quite-standard spaces that Unicode encompasses, ... Well what if those not-quite-standard space characters actually play a vital role in the layout of a language script? Indeed some of them do. For instance, other threads on this XeTeX list are talking about ZWJ and ZWNJ, and I've already mentioned things like the LTR and RTL indicators. Almost certainly many of the other characters are handled specially already by the OS software that XeTeX passes the main decisions to. So changing this at input level for XeTeX could completely change the visual appearance of the output, in ways that TeX software has no way to fix. In other terms, those extra space characters are programming instructions for other non-TeX-based software. XeTeX needs to pass them on unchanged, if that software is to give back to XeTeX the high-quality typeset output building blocks that it needs to position on the page. By accepting Unicode input, and passing it along to other software, TeX has inherited the ability to handle many, many more languages and scripts than it ever could do properly before. This is as well as making a much richer set of fonts available for use in XeTeX-produced PDFs. It does these things by piggy-backing on the work of others, developed by people who might have absolutely no idea of what TeX is, nor how it works, and probably would not care even if they did. It is a win-win all round --- something that is very rare these days. But this does come with a price. It means that XeTeX-produced output can be OS dependent, unlike with other TeX software! Also, successful compilation to the desired output can be dependent on having the correct version of a font installed. Many posts on the XeTeX list have been about such issues. or do we look for an alternative model which /might/ be glyph- or character-based ?. My view is no we should not, at least not to become the default way that XeTeX handles its input. By all means write packages that can be used in particular situations where such characters are producing observable unwanted effects on the final output. But this should be done at the package level (e.g. by a \catcode change, and macro definition). Then the source document will have a line in the preamble that indicates that there could be a deviation from default behaviours. This is an indication that there is something special about the source stream, and someone with appropriate knowledge has worked out how to deal with it. But for general (default) usage, the non-ASCII characters representing Unicode code-points that go in should be treated as exactly those Unicode code-points. Alternatively, use the editor to change the unwanted characters to ordinary spaces, or whatever else works well with TeX processing.
Re: [XeTeX] Whitespace in input
2011/11/17 Ross Moore ross.mo...@mq.edu.au: Hello Zdenek, On 18/11/2011, at 7:49 AM, Zdenek Wagner wrote: But a formatting instruction for one program cannot serve as reliable input for another. A heuristic is then needed, to attempt to infer that a programming instruction must have been used, and guess what kind of instruction it might have been. This is not 100% reliable, so is deprecated in modern methods of data storage and document formats. XML based formats use tagging, rather that programming instructions. This is the modern way, which is used extensively for communicating data between different software systems. Yes, that's the point. The goal of TeX is nice typographical appearance. The goal of XML is easy data exchange. If I want to send structured data, I send XML, not PDF. These days people want both. ** Phil. TeX's strength is in its superior ability to position characters on the page for maximum visual effect. This is done by producing detailed programming instructions within the content stream of the PDF output. However, this is not enough to meet the needs of formats such as EPUB, non-visual reading software, archival formats, searchability, and other needs. Tagged PDF can be viewed as Adobe's response to address these requirements as an extension of the visual aspects of the PDF format. It is a direction in which TeX can (and surely must) move, to stay relevant within the publishing industry of the future. Hope this helps, Ross No, it does not help. Remember that tha last (almost) portable version of PDF is 1.2. If you are to open tagged PDF or even PDF with a toUnicode map or a colorspace other than RGB or CMYK in Acrobat Reader 3, it displays a fatal error and dies. I reported it to Adobe in March 2001 and they did nothing. What else would you expect? AR is at version 10 now. On Linux it is at version 9 now, indeed 9.4.6 is current. For OS/2 (now eComStation) the latest AR is at version 3 with known bugs not fixed. You don't expect TeX formats prior to TeX3 to handle non-ascii characters, so why would you expect other people's older software versions to handle documents written for later formats? I even reported another fatal bug in January 2001. I sent sample files but nothing happened, Adobe just stopped development of Acrobat Reader at buggy version 3 for some operating systems. Why should they support OSs that have a limited life-time? Industry moves on. A new computer is very cheap these days, with software that can do things your older one never could do. Yes, since that time OS/3 evolved from version 2 through 3, Warp Connesct, 4, 4.5, eComstation 1.0, eComStation 1.1 to eComStation 2.0, yet AR remained and version 3. By all means keep the old one while it still does useful work, but you get another to do things that the older cannot handle. If I compare multitasking of OS/2 on my old Celeron 333 MHz with Linux running on quad core Intel 4.3 Ghz, the winner is still OS/2. If I have a single thread in mind, 4.3 GHz is of course faster but multitasking and multithreading is made much better in OS/2. A few years ago I made a comparison with a long numerical calculation on OS/2 (Celeron 333 MHz) and Windows XP (Intel 250 MHz). The program took 16 hours on OS/2 running Apache server at the same time and 240 hours on Windows running only this program. I am not sure that I find the very same program now but judging form similar programs I would expect 6 hours on quad core 4.3 GHz with Linux. Are you surprised that I am not satisfied with progress in HW and OS? Why do you so much rely on Adobe? When exchanging structured documents I will always do it in XML and never create tagged PDF because ... PDF, as a published standard, is not maintained by Adobe itself these days, yet Adobe continues to provide a free reader, at least for the visual aspects. That makes documents in PDF viewable by everyone (who is only interested in the visual aspect). It is an ISO standard, which publishers will want to use. Most of the people who use (La)TeX are academics or others who need to do a fair amount of publishing, of one kind or another. TeX can be modified to become capable of producing Tagged PDF. (See the videos of my talks.) Free software (Poppler) is being developed to handle most aspects of PDF content, though it hasn't yet progressed enough to support structure tagging. It's surely on the list of things to do. Yes, it is good for extraction even on OS/2 (I do not know whether people compiled poppler, but xpdf binaries are available). ... I know that some users will be unable to read them by Adobe Acrobat Reader. Why not? It is not Adobe Reader that is holding them back. I do not wish to make them dependent on ghostscript and similar tools. You'll have to give some more details of who you are referring to her, and why their economic circumstances require them to have access to
Re: [XeTeX] Whitespace in input
Hi Philip, We are basically are following the same lines. TeX is foremost a layout program based standard printers methology.where the space character is white space and not a glyph. We actually, do have to differentiate between the two in discussions. The crux of of the problem is in (Xe)TeX's parsing algorithm. I never liked it and personally I have many problems it. regards Keith. Am 17.11.2011 um 13:53 schrieb Philip TAYLOR: Keith J. Schultz wrote: Am 17.11.2011 um 11:26 schrieb Keith J. Schultz: O.K. You mention in a later post that you do consider a space as a printable character. This line should read as: You mention in a later post that you consider a space as a non-printable character. No, I don't think of it as a character at all, when we are talking about typeset output (as opposed to ASCII (or Unicode) input). Clearly it is a character on input, but unless it generates a glyph in the output stream (which TeX does not, for normal spaces) then it is not a character (/qua/ character) on output but rather a formatting instruction not dissimilar to (say) end-of-line. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
Keith J. Schultz wrote: The crux of of the problem is in (Xe)TeX's parsing algorithm. I never liked it and personally I have many problems it. Is this XeTeX-specific, Keith, or do you also dislike TeX's parsing algorithm ? And what is it that you dislike, and how would you propose that it be improved ? ** Phil. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
Hi Tobias, Am 14.11.2011 um 18:42 schrieb Tobias Schoel: Am 14.11.2011 18:30, schrieb msk...@ansuz.sooke.bc.ca: [snip, snip] Now we come to the trouble of Unicode specifying a line-breaking algorithm ( http://www.unicode.org/reports/tr14/tr14-26.html ), which probably isn't exactly TeX's. I'm not into these algorithms, so I can't compare. But I would ask some Master of this Art to speak up about this conflict. I went and briefly look at the annex. In the beginning it states that the annexes are not necessarily a requirement unless mentioned in the standard! I did not check the standard, but as you read on the description of the LBA is not mandatory at all. Furthermore, it more or less describes which characters are directly involved with line breaking (top of table 1). The rest is just a suggest how one Might go about achieving line breaking. This is not a standard at all. Since TeX has its own line breaking algorithms we need not be interrested with the content of this annex as far as Unicode is concerned. What you should be aware of is that the LBA is intended as an aide for a preprocessor to a more elaborate line breaking algorithm. It has been approved for printing, but no where does it state that it must be followed nor that it is complete. In other words it is merely a suggestion. There is no conflict per se. Just another way of dealing with line breaking. There is no real standard for line breaking. It is more or less a matter of taste, style and aesthetics. (Yes, there are many conventions that should be observed, and many are grammatical in nature). regards Keith. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
On Tue, Nov 15, 2011 at 2:27 AM, Keith J. Schultz keithjschu...@web.de wrote: Hi all, I agree that XeTeX should support all printable characters. Given your definition I would say all visible printed characters. Invisible characters are a problem in a programming language. A non.breaking space is to me a printable character, in so far that it is important and must be used to distinguish between word space, et all. As long as this is an option which defaults to off, again I have no problem with this. I mean by this definition, carriage returns and line feeds are also printable characters, and these are supported by options which are turned on rather than on by default. To go back in history, one of my pet peeves in LaTeX was that I had to enter the German characters öäüß as \o, \a, etc and later the short cut forms s, u, etc. later with inputenc I finally, could just enter öäüß.But I had trouble, (actually just needed to convert) my files to and from apple to windows (so that editing was possible on windows). Yet, I still had trouble with quoting, so I was force to use \quote, et al. to have a simple method of quoting properly in english, german and french in one document! I even modified them to suite some requirements I need and I had one command. Unicode has thankfully change all this. I can forget about using all those TeX commands for the characters I need. I just type away. The only problem is now is the keyboard equivalents and how the editor of choice displays them. But here you have a problem. An editor can display a non-breaking space as its semantic value (i.e. with a special glyph, but this is not without problems. For example, we could also display line feeds as the paragraph symbol but now that's also U+00B6, so now you have ambiguity issues-- is it a unicode character or is it a line feed). or you can color code, but this is problematic for a large number of other reasons. So I am not sure these are simple problems that admit of simple solutions. My recommendation is: 1) Default to handling all white space as it exists now. 2) Provide some sort of switch, whether to the execution of XeTeX or to the document itself, to turn on handling of special unicode characters. 3) If that switch is enabled, then treat the whitespaces according to unicode meanings. If not, treat them as standard whitespace. The advantage of this approach is that people who don't want to worry about what sort of whitespace is in text files they are inputting don't have to worry about it, and that those who do have an easy way of determining if a layout issue is caused by non-breaking spaces. Best Wishes, Chris Travers -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
Keith J. Schultz wrote: A non.breaking space is to me a printable character, in so far that it is important and must be used to distinguish between word space, et all. If, for you, [a] non.breaking space is a printable character, then presumably that character must be taken from some font. If you take a character from a font, it will have a size, and although it can be combined with kerning rules to adjust its position w.r.t. adjacent characters, the logic for this is fairly restricted. In particular, it cannot take into account the amount by which TeX is seeking to expand or contract spaces on the current line in order to achieve optimal paragraphs. So in your model of the ideal universe, non-breaking Unicode spaces would not behave as do conventional TeX non-breaking spaces (which /do/ expand and contract to assist in TeX's line-breaking), nor would they conform to their Unicode definition where their decomposition is defined as : noBreak SPACE (U+0020) I wonder if you would like to discuss these points ? Philip Taylor -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
On 11/15/2011 5:39 AM, Chris Travers wrote: My recommendation is: 1) Default to handling all white space as it exists now. 2) Provide some sort of switch, whether to the execution of XeTeX or to the document itself, to turn on handling of special unicode characters. 3) If that switch is enabled, then treat the whitespaces according to unicode meanings. If not, treat them as standard whitespace. I think you asked me earlier whether that would satisfy me, and I failed to answer. Yes, it would. -- Mike Maxwell maxw...@umiacs.umd.edu My definition of an interesting universe is one that has the capacity to study itself. --Stephen Eastmond -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
2011/11/15 Mike Maxwell maxw...@umiacs.umd.edu: On 11/15/2011 5:39 AM, Chris Travers wrote: My recommendation is: 1) Default to handling all white space as it exists now. 2) Provide some sort of switch, whether to the execution of XeTeX or to the document itself, to turn on handling of special unicode characters. 3) If that switch is enabled, then treat the whitespaces according to unicode meanings. If not, treat them as standard whitespace. I think you asked me earlier whether that would satisfy me, and I failed to answer. Yes, it would. But such a solution is not clean, you cannot plug in such logic to the TeX mouth when the input is being read nor to the output stage when TECkit maps are in effect. I wrote the reasons earlier. The only reasonable solution seems to be the one suggested by Phil Taylor, to extend \catcode up to 255 and assign special categories to other types of characters. Thus we could say that normal space id 10, nonbreakable space is 16, thin space is 17 etc. XeTeX will then be able to treat them properly. -- Mike Maxwell maxw...@umiacs.umd.edu My definition of an interesting universe is one that has the capacity to study itself. --Stephen Eastmond -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
2011/11/15 Zdenek Wagner zdenek.wag...@gmail.com: 2011/11/15 Mike Maxwell maxw...@umiacs.umd.edu: On 11/15/2011 5:39 AM, Chris Travers wrote: My recommendation is: 1) Default to handling all white space as it exists now. 2) Provide some sort of switch, whether to the execution of XeTeX or to the document itself, to turn on handling of special unicode characters. 3) If that switch is enabled, then treat the whitespaces according to unicode meanings. If not, treat them as standard whitespace. I think you asked me earlier whether that would satisfy me, and I failed to answer. Yes, it would. But such a solution is not clean, you cannot plug in such logic to the TeX mouth when the input is being read nor to the output stage when TECkit maps are in effect. I wrote the reasons earlier. The only reasonable solution seems to be the one suggested by Phil Taylor, to extend \catcode up to 255 and assign special categories to other types of characters. Thus we could say that normal space id 10, nonbreakable space is 16, thin space is 17 etc. XeTeX will then be able to treat them properly. But we are talking two different things here. The first is user interface, and the second is mechanism. What I am saying is special handling of this sort should be required to be enabled somehow by the user. I don't really care how. It could be by a commandline switch to xelatex. It could be by a call in the document if that's possible. It should be optional, and disabled by default, given that the characters involved are not intended to be displayed with glyphs. Best Wishes, Chris Travers -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
Zdenek Wagner wrote: The only reasonable solution seems to be the one suggested by Phil Taylor, to extend \catcode up to 255 and assign special categories to other types of characters. Thus we could say that normal space id 10, nonbreakable space is 16, thin space is 17 etc. XeTeX will then be able to treat them properly. which may, unfortunately, then require new types of node in TeX's internal list structures ... (may, not will). ** Phil. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
Chris Travers wrote: But we are talking two different things here. The first is user interface, and the second is mechanism. What I am saying is special handling of this sort should be required to be enabled somehow by the user. I don't really care how. It could be by a commandline switch to xelatex. It could be by a call in the document if that's possible. It should be optional, and disabled by default, given that the characters involved are not intended to be displayed with glyphs. But /if/ it requires a change to the number of category codes (and/or the creation of one or more classes of internal node), then this is not something that should be capable of being turned on or off within a document. I don't have any problem with the idea of turning the functionality on or off either within a format file or from a command-line qualifier. ** Phil. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
2011/11/15 Chris Travers chris.trav...@gmail.com: 2011/11/15 Zdenek Wagner zdenek.wag...@gmail.com: 2011/11/15 Mike Maxwell maxw...@umiacs.umd.edu: On 11/15/2011 5:39 AM, Chris Travers wrote: My recommendation is: 1) Default to handling all white space as it exists now. 2) Provide some sort of switch, whether to the execution of XeTeX or to the document itself, to turn on handling of special unicode characters. 3) If that switch is enabled, then treat the whitespaces according to unicode meanings. If not, treat them as standard whitespace. I think you asked me earlier whether that would satisfy me, and I failed to answer. Yes, it would. But such a solution is not clean, you cannot plug in such logic to the TeX mouth when the input is being read nor to the output stage when TECkit maps are in effect. I wrote the reasons earlier. The only reasonable solution seems to be the one suggested by Phil Taylor, to extend \catcode up to 255 and assign special categories to other types of characters. Thus we could say that normal space id 10, nonbreakable space is 16, thin space is 17 etc. XeTeX will then be able to treat them properly. But we are talking two different things here. The first is user interface, and the second is mechanism. What I am saying is special handling of this sort should be required to be enabled somehow by the user. I don't really care how. It could be by a commandline switch to xelatex. It could be by a call in the document if that's possible. It should be optional, and disabled by default, given that the characters involved are not intended to be displayed with glyphs. The mechanism is simple, set this \catcode to 13 and define it as \nobreak\space. If you wish to make it clever in all XeLaTeX corners, find one of my previous posts to see what has to be taken into account. It may be present in a package called nbsp.sty or so. No change in XeTeX is needed if you do it this way. Best Wishes, Chris Travers -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
2011/11/15 Philip TAYLOR p.tay...@rhul.ac.uk: Zdenek Wagner wrote: The only reasonable solution seems to be the one suggested by Phil Taylor, to extend \catcode up to 255 and assign special categories to other types of characters. Thus we could say that normal space id 10, nonbreakable space is 16, thin space is 17 etc. XeTeX will then be able to treat them properly. which may, unfortunately, then require new types of node in TeX's internal list structures ... (may, not will). Sure, the change will not be trivial. I do not know how the category codes are stored internally but extending them from 16 possible values to 256 may require dramatic change in the internal structures. ** Phil. -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
2011/11/15 Philip TAYLOR p.tay...@rhul.ac.uk: Chris Travers wrote: But we are talking two different things here. The first is user interface, and the second is mechanism. What I am saying is special handling of this sort should be required to be enabled somehow by the user. I don't really care how. It could be by a commandline switch to xelatex. It could be by a call in the document if that's possible. It should be optional, and disabled by default, given that the characters involved are not intended to be displayed with glyphs. But /if/ it requires a change to the number of category codes (and/or the creation of one or more classes of internal node), then this is not something that should be capable of being turned on or off within a document. I don't have any problem with the idea of turning the functionality on or off either within a format file or from a command-line qualifier. If you know what such characters are (and it will certainly be documented), you just set their categories back to 12 in order to get the old behaviour. ** Phil. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
Zdenek Wagner wrote: If you know what such characters are (and it will certainly be documented), you just set their categories back to 12 in order to get the old behaviour. No ! A catcode is for life, not just for Christmas ! Once a character has been read, and bound into a character/catcode pair, that catcode remains immutable. That means that code that is /not/ expecting to have to deal with non-standard catcodes could none the less be passed token lists containing such entities if it is possible, within a document, to turn such a feature on and off again. ** Phil. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
On Tue, Nov 15, 2011 at 02:20:17PM +, Philip TAYLOR wrote: No ! A catcode is for life, not just for Christmas ! Once a character has been read, and bound into a character/catcode pair, that catcode remains immutable. Do you mean that as a general good practice in TeX programming, or as a description of how TeX works? The latter is obviously wrong. Arthur -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
2011/11/15 Philip TAYLOR p.tay...@rhul.ac.uk: Zdenek Wagner wrote: If you know what such characters are (and it will certainly be documented), you just set their categories back to 12 in order to get the old behaviour. No ! A catcode is for life, not just for Christmas ! Once a character has been read, and bound into a character/catcode pair, that catcode remains immutable. That means that code that is /not/ expecting to have to deal with non-standard catcodes could none the less be passed token lists containing such entities if it is possible, within a document, to turn such a feature on and off again. Of course, I know it. What I meant was that you could set \catcode of all these extended characters to 12 at the beginning of your document. Thus you get the same behaviour as now. ** Phil. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
Arthur Reutenauer wrote: On Tue, Nov 15, 2011 at 02:20:17PM +, Philip TAYLOR wrote: No ! A catcode is for life, not just for Christmas ! Once a character has been read, and bound into a character/catcode pair, that catcode remains immutable. Do you mean that as a general good practice in TeX programming, or as a description of how TeX works? The latter is obviously wrong. The latter is what the TeXbok says (P.~39) : Once a category code has been attached to a character token, the attachment is permanent. ** Phil. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
On Nov 15, 2011, at 8:52 AM, Philip TAYLOR wrote: Arthur Reutenauer wrote: On Tue, Nov 15, 2011 at 02:20:17PM +, Philip TAYLOR wrote: No ! A catcode is for life, not just for Christmas ! Once a character has been read, and bound into a character/catcode pair, that catcode remains immutable. Do you mean that as a general good practice in TeX programming, or as a description of how TeX works? The latter is obviously wrong. The latter is what the TeXbok says (P.~39) : Once a category code has been attached to a character token, the attachment is permanent. ** Phil. Howdy, What happens in a verbatim environment? Good Luck, Herb Schulz (herbs at wideopenwest dot com) -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
Zdenek Wagner wrote: Of course, I know it. What I meant was that you could set \catcode of all these extended characters to 12 at the beginning of your document. Thus you get the same behaviour as now. Ah yes : with that, I have no problem. ** Phil. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
2011/11/15 Herbert Schulz he...@wideopenwest.com: On Nov 15, 2011, at 8:52 AM, Philip TAYLOR wrote: Arthur Reutenauer wrote: On Tue, Nov 15, 2011 at 02:20:17PM +, Philip TAYLOR wrote: No ! A catcode is for life, not just for Christmas ! Once a character has been read, and bound into a character/catcode pair, that catcode remains immutable. Do you mean that as a general good practice in TeX programming, or as a description of how TeX works? The latter is obviously wrong. The latter is what the TeXbok says (P.~39) : Once a category code has been attached to a character token, the attachment is permanent. ** Phil. Howdy, What happens in a verbatim environment? It will have to be redefined, there will just be additional special characters that will have to be handled. \XeTeXrevision will give you information whether extended \catcode is implemented. Good Luck, Herb Schulz (herbs at wideopenwest dot com) -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
The latter is what the TeXbok says (P.~39) : Once a category code has been attached to a character token, the attachment is permanent. Yes, because you meant individual tokens (which I understood in retrospect). But in the context of the discussion, you really seemed to be saying that you could not change the \catcode's of characters to be read, which was the point (not that there is much point left to the whole threads any more...) Arthur -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
Arthur Reutenauer wrote: The latter is what the TeXbok says (P.~39) : Once a category code has been attached to a character token, the attachment is permanent. Yes, because you meant individual tokens (which I understood in retrospect). But in the context of the discussion, you really seemed to be saying that you could not change the \catcode's of characters to be read, which was the point (not that there is much point left to the whole threads any more...) No no : changing catodes on the fly is standard TeX programming; what we should not contemplate is changing the /number/ of catcodes on the fly ... ** Phil. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
Herbert Schulz wrote: The latter is what the TeXbok says (P.~39) : Once a category code has been attached to a character token, the attachment is permanent. ** Phil. What happens in a verbatim environment? The verbatim environment sets up an environment within which characters that have not yet been seen by TeX's mouth receive category codes that potentially differ from the category code that would normally be associated with that character. Once the category code has been bound to a particular instance of that character, that instance never changes its catcode. ** Phil. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
On Nov 15, 2011, at 11:19 AM, Philip TAYLOR wrote: Herbert Schulz wrote: The latter is what the TeXbok says (P.~39) : Once a category code has been attached to a character token, the attachment is permanent. ** Phil. What happens in a verbatim environment? The verbatim environment sets up an environment within which characters that have not yet been seen by TeX's mouth receive category codes that potentially differ from the category code that would normally be associated with that character. Once the category code has been bound to a particular instance of that character, that instance never changes its catcode. ** Phil. Howdy, So what you are saying is not that you can't control the catcode of a particular character but that you can't change it after it is set and in TeX's ``stomach.'' That I can agree with. Good Luck, Herb Schulz (herbs at wideopenwest dot com) -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
On Nov 15, 2011, at 11:11 AM, Herbert Schulz wrote: On Nov 15, 2011, at 11:19 AM, Philip TAYLOR wrote: Herbert Schulz wrote: The latter is what the TeXbok says (P.~39) : Once a category code has been attached to a character token, the attachment is permanent. ** Phil. What happens in a verbatim environment? The verbatim environment sets up an environment within which characters that have not yet been seen by TeX's mouth receive category codes that potentially differ from the category code that would normally be associated with that character. Once the category code has been bound to a particular instance of that character, that instance never changes its catcode. ** Phil. Howdy, So what you are saying is not that you can't control the catcode of a particular character but that you can't change it after it is set and in TeX's ``stomach.'' That I can agree with. Good Luck, Herb Schulz (herbs at wideopenwest dot com) Howdy, What I meant to say was... So what you are saying is not that you can control the catcode of a particular character but that you can't change it after it is set and in TeX's ``stomach.'' That I can agree with. (notice the can't control --- can control) Good Luck, Herb Schulz (herbs at wideopenwest dot com) -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
I think it made more sense with can't, Herb, but that could be a trans-Atlantic difference of usage -- you would, I think, say I could care less where I would say I couldn't care less. ** Phil. Herbert Schulz wrote: What I meant to say was... So what you are saying is not that you can control the catcode of a particular character but that you can't change it after it is set and in TeX's ``stomach.'' That I can agree with. (notice the can't control --- can control) -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
On Nov 15, 2011, at 2:43 PM, Ross Moore wrote: On 16/11/2011, at 5:56 AM, Herbert Schulz wrote: Given that TeX (and XeTeX too) deal wit a non-breakble space already (where we usually use the ~ to represent that space) it seems to me that XeTeX should treat that the same way. No, I disagree completely. What if you really want the Ux00A0 character to be in the PDF? That is, when you copy/paste from the PDF, you want that character to come along for the ride. In TeX ~ *simulates* a non-breaking space visually, but there is no actual character inserted. If you want the character you have to ensure that it gets there, and what more natural way is there than to put it in explicitly. This is how XeTeX treats it currently, according to my experiments, using just fontspec and Charis SIL font. Anyone who has a different experience should check what other packages and fonts are being loaded, and whether there is something that specifically changes how that character is handled. Howdy, But isn't that also true about a regular space character? Doesn't (Xe)TeX insert some glue rather than a Space Character? The big puzzle will happen when someone, not using an editor capable of displaying invisibles, can't understand why they can't get XeTeX to break between the two words. That is an editor problem, not one that XeTeX itself should be concerned with. Agreed. But I'll be you end up with lots of questions on ctt/texhax/etc. about line breaking; assuming that the non-breaking space actually does it's ``job.'' Now having Ux00A0 between two words may change the way hyphenation works for those words. But surely if you are wanting to inhibit a line-break between words, you probably also don't want either word to be hyphenated. So this could really be the correct thing. or not. :-) Good Luck, Herb Schulz (herbs at wideopenwest dot com) -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
2011/11/15 Ross Moore ross.mo...@mq.edu.au: On 16/11/2011, at 5:56 AM, Herbert Schulz wrote: Given that TeX (and XeTeX too) deal wit a non-breakble space already (where we usually use the ~ to represent that space) it seems to me that XeTeX should treat that the same way. No, I disagree completely. What if you really want the Ux00A0 character to be in the PDF? That is, when you copy/paste from the PDF, you want that character to come along for the ride. From the typographical point of view it is the worst of all possible methods. If you really wish it, then do not use TeX but M$ Word or OpenOffice. M$ Word automatically inserts nonbreakable spaces at some points in the text written in Czech. As far as grammer is concerned, it is correct. However, U+00a0 is fixed width. If you look at the output, the nonbreakable spaces are too wide on some lines and too thin on other lines. I cannot imagine anything uglier. -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
Hi Zdenek, On 16/11/2011, at 8:58 AM, Zdenek Wagner wrote: 2011/11/15 Ross Moore ross.mo...@mq.edu.au: On 16/11/2011, at 5:56 AM, Herbert Schulz wrote: Given that TeX (and XeTeX too) deal wit a non-breakble space already (where we usually use the ~ to represent that space) it seems to me that XeTeX should treat that the same way. No, I disagree completely. What if you really want the Ux00A0 character to be in the PDF? That is, when you copy/paste from the PDF, you want that character to come along for the ride. From the typographical point of view it is the worst of all possible methods. If you really wish it, The *really wish it* is the choice of the author, not the software. then do not use TeX but M$ Word or OpenOffice. M$ Word automatically inserts nonbreakable spaces at some points in the text written in Czech. As far as grammer is concerned, it is correct. However, U+00a0 is fixed width. If you look at the output, the nonbreakable spaces are too wide on some lines and too thin on other lines. I cannot imagine anything uglier. I do not disagree with you that this could be ugly. But that is not the point. If you want superior aesthetic typesetting, with nice choices for hyphenation, then don't use Ux00A0. Of course! Whatever the reason for wanting to use this character, there should be a straight-forward way to do it. Using the character itself is: a. the most understandable b. currently works c. requires no special explanation. -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz Cheers, Ross Ross Moore ross.mo...@mq.edu.au Mathematics Department office: E7A-419 Macquarie University tel: +61 (0)2 9850 8955 Sydney, Australia 2109 fax: +61 (0)2 9850 8114 -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
2011/11/15 Ross Moore ross.mo...@mq.edu.au: Hi Zdenek, On 16/11/2011, at 8:58 AM, Zdenek Wagner wrote: 2011/11/15 Ross Moore ross.mo...@mq.edu.au: On 16/11/2011, at 5:56 AM, Herbert Schulz wrote: Given that TeX (and XeTeX too) deal wit a non-breakble space already (where we usually use the ~ to represent that space) it seems to me that XeTeX should treat that the same way. No, I disagree completely. What if you really want the Ux00A0 character to be in the PDF? That is, when you copy/paste from the PDF, you want that character to come along for the ride. From the typographical point of view it is the worst of all possible methods. If you really wish it, The *really wish it* is the choice of the author, not the software. then do not use TeX but M$ Word or OpenOffice. M$ Word automatically inserts nonbreakable spaces at some points in the text written in Czech. As far as grammer is concerned, it is correct. However, U+00a0 is fixed width. If you look at the output, the nonbreakable spaces are too wide on some lines and too thin on other lines. I cannot imagine anything uglier. I do not disagree with you that this could be ugly. But that is not the point. If you want superior aesthetic typesetting, with nice choices for hyphenation, then don't use Ux00A0. Of course! Whatever the reason for wanting to use this character, there should be a straight-forward way to do it. Using the character itself is: a. the most understandable b. currently works c. requires no special explanation. These are reasons why people might wish it in the source files, not in PDF. If you wish to take a [part of] PDF and include it in another PDF as is, you can take the PDF directly without the need of grabbing the text. If you are interested in the text that will be retypeset, you have to verify a lot of other things. If the text contained hyphenated words, you have to join the parts manually. You will have a lot of other work and the time saved by U+00a0 will be negligible. There are tools that may help you to insert nonbreakable spaces. I have even my own special tools written in perl to handle one class of input files that are really plain texts and the result is (almost) correctly marked LaTeX source. -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz Cheers, Ross Ross Moore ross.mo...@mq.edu.au Mathematics Department office: E7A-419 Macquarie University tel: +61 (0)2 9850 8955 Sydney, Australia 2109 fax: +61 (0)2 9850 8114 -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
Hi Phil, On 16/11/2011, at 8:45 AM, Philip TAYLOR wrote: Ross Moore wrote: On 16/11/2011, at 5:56 AM, Herbert Schulz wrote: Given that TeX (and XeTeX too) deal wit a non-breakble space already (where we usually use the ~ to represent that space) it seems to me that XeTeX should treat that the same way. No, I disagree completely. What if you really want the Ux00A0 character to be in the PDF? That is, when you copy/paste from the PDF, you want that character to come along for the ride. I'm not sure I entirely go along with this argument, Ross. What if you really want the \ character to be in the PDF, or the ^ character, or the $ character, or any character that TeX currently treats specially ? TeX already provides \$ \_ \# etc. for (most of) the other special characters it uses, but does not for ^^A0 --- but it does not need to if you can generate it yourself on the keyboard. Whilst I can agree that there is considerable merit in extending XeTeX such that it treats all of these new, special characters specially (by creating new catcodes, new node types and so on), in the short term I can see no fundamental problem with treating U+00A0 in such a way that it behaves indistinguishably from the normal expansion of ~. How do you explain to somebody the need to do something really, really special to get a character that they can type, or copy/paste? There is no special role for this character in other vital aspects of how TeX works, such as there is for $ _ # etc. In TeX ~ *simulates* a non-breaking space visually, but there is no actual character inserted. And I don't agree that a space is a character, non-breaking or not ! In this view you are against most of the rest of the world. If the output is intended to be PDF, as it really has to be with XeTeX, then the specifications for the modern variants of PDF need to be consulted. With PDF/A and PDF/UA and anything based on ISO-32000 (PDF 1.7) there is a requirement that the included content should explicitly provide word boundaries. Having a space character inserted is by far the most natural way to meet this specification. (This does not mean that having such a character in the output need affect TeX's view of typesetting.) Before replying to anything in the above paragraph, please watch the video of my recent talk at TUG-2011. http://river-valley.tv/further-advances-toward-tagged-pdf-for-mathematics/ or similar from earlier years where I also talk a bit about such things. ** Phil. Hope this helps, Ross Ross Moore ross.mo...@mq.edu.au Mathematics Department office: E7A-419 Macquarie University tel: +61 (0)2 9850 8955 Sydney, Australia 2109 fax: +61 (0)2 9850 8114 -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
2011/11/15 Ross Moore ross.mo...@mq.edu.au: Hi Phil, On 16/11/2011, at 8:45 AM, Philip TAYLOR wrote: Ross Moore wrote: On 16/11/2011, at 5:56 AM, Herbert Schulz wrote: Given that TeX (and XeTeX too) deal wit a non-breakble space already (where we usually use the ~ to represent that space) it seems to me that XeTeX should treat that the same way. No, I disagree completely. What if you really want the Ux00A0 character to be in the PDF? That is, when you copy/paste from the PDF, you want that character to come along for the ride. I'm not sure I entirely go along with this argument, Ross. What if you really want the \ character to be in the PDF, or the ^ character, or the $ character, or any character that TeX currently treats specially ? TeX already provides \$ \_ \# etc. for (most of) the other special characters it uses, but does not for ^^A0 --- but it does not need to if you can generate it yourself on the keyboard. 00a0 Whilst I can agree that there is considerable merit in extending XeTeX such that it treats all of these new, special characters specially (by creating new catcodes, new node types and so on), in the short term I can see no fundamental problem with treating U+00A0 in such a way that it behaves indistinguishably from the normal expansion of ~. How do you explain to somebody the need to do something really, really special to get a character that they can type, or copy/paste? There is no special role for this character in other vital aspects of how TeX works, such as there is for $ _ # etc. In TeX ~ *simulates* a non-breaking space visually, but there is no actual character inserted. And I don't agree that a space is a character, non-breaking or not ! In this view you are against most of the rest of the world. TeX NEVER outputs a space as a glyph. Text extraction tools usually interpret horizontal spaces of sufficient size as U+0020. (The exception to the above mentioned never is the verbatim mode.) If the output is intended to be PDF, as it really has to be with XeTeX, then the specifications for the modern variants of PDF need to be consulted. With PDF/A and PDF/UA and anything based on ISO-32000 (PDF 1.7) there is a requirement that the included content should explicitly provide word boundaries. Having a space character inserted is by far the most natural way to meet this specification. A space character is a fixed-width glyph. If you insist in it, you will never be able to typeset justified paragraphs, you will move back to the era of mechanical typewriters. (This does not mean that having such a character in the output need affect TeX's view of typesetting.) Before replying to anything in the above paragraph, please watch the video of my recent talk at TUG-2011. http://river-valley.tv/further-advances-toward-tagged-pdf-for-mathematics/ or similar from earlier years where I also talk a bit about such things. ** Phil. Hope this helps, Ross Ross Moore ross.mo...@mq.edu.au Mathematics Department office: E7A-419 Macquarie University tel: +61 (0)2 9850 8955 Sydney, Australia 2109 fax: +61 (0)2 9850 8114 -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
I was going to make the following point earlier--maybe in light of Phil's conclusion I should do it now. There seems to be a tendency not to distinguish between a(n orginal) character in the sense of character of a writing system, and a computer character. The former are visible symbols on a background medium. The latter are an entirely different set of symbols which to some extent parallel the former, and some extent do not. Space, control codes, etc. don't exist in the former, but exist in the latter because it was a convenient way to encode certain functions one wished to apply to the encoded other characters--the ones that correspond more or less to original writing system characters. These encoding sets have developed over time, and have consequently inherited all sorts of legacy issues, not all of which need supporting. Unicode provides tools. No one says one has to use them all. Specifically, the purpose of XeTeX and other such engines is to all for the nice typographical formatting of visual representations of script characters against some other defined background. From that point of view, so long as it does it, once it does it, it has achieved its goal. Transparency of all sorts of other things, providing input via PDF to other software isn't and shouldn't be a *primary* goal. That being said, no doubt it might be helpful to some to have this or that control character passed along. But that's not the essence of the exercise, and should only be done if it can be done cheaply, i.e. without a lot of risk to the primary objective. I guess the real question is that latter part. K On Tue, Nov 15, 2011 at 4:45 PM, in message 4ec2dd63.3040...@rhul.ac.uk, Philip TAYLOR p.tay...@rhul.ac.uk wrote: Ross Moore wrote: On 16/11/2011, at 5:56 AM, Herbert Schulz wrote: Given that TeX (and XeTeX too) deal wit a non-breakble space already (where we usually use the ~ to represent that space) it seems to me that XeTeX should treat that the same way. No, I disagree completely. What if you really want the Ux00A0 character to be in the PDF? That is, when you copy/paste from the PDF, you want that character to come along for the ride. I'm not sure I entirely go along with this argument, Ross. What if you really want the \ character to be in the PDF, or the ^ character, or the $ character, or any character that TeX currently treats specially ? Whilst I can agree that there is considerable merit in extending XeTeX such that it treats all of these new, special characters specially (by creating new catcodes, new node types and so on), in the short term I can see no fundamental problem with treating U+00A0 in such a way that it behaves indistinguishably from the normal expansion of ~. In TeX ~ *simulates* a non-breaking space visually, but there is no actual character inserted. And I don't agree that a space is a character, non-breaking or not ! ** Phil. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
Hi Phil, On 16/11/2011, at 10:08 AM, Zdenek Wagner wrote: How do you explain to somebody the need to do something really, really special to get a character that they can type, or copy/paste? There is no special role for this character in other vital aspects of how TeX works, such as there is for $ _ # etc. In TeX ~ *simulates* a non-breaking space visually, but there is no actual character inserted. And I don't agree that a space is a character, non-breaking or not ! In this view you are against most of the rest of the world. TeX NEVER outputs a space as a glyph. Text extraction tools usually interpret horizontal spaces of sufficient size as U+0020. I never said that it did, nor that it was necessary to do so. Those text extraction tools do a pretty reasonable job, but don't always get it right. Besides, there is reliance on a heuristic, which can be fallible, especially if there is content typeset in a very small font size. And what about at line-ends? They can get that wrong too. Such a reliance is rather against the TeX way of doing things, don't you think? Better is for TeX itself to apply the heuristic, since it knows the current font size and the separation between bits of words. (The exception to the above mentioned never is the verbatim mode.) That isn't good enough for TeX to produce PDF/A. Go and watch the videos that I pointed you to. Lower down I give a run-down of how a variant of TeX handles this problem, to very good effect. If the output is intended to be PDF, as it really has to be with XeTeX, then the specifications for the modern variants of PDF need to be consulted. With PDF/A and PDF/UA and anything based on ISO-32000 (PDF 1.7) there is a requirement that the included content should explicitly provide word boundaries. Having a space character inserted is by far the most natural way to meet this specification. A space character is a fixed-width glyph. If you insist in it, you will never be able to typeset justified paragraphs, you will move back to the era of mechanical typewriters. Absolutely wrong! I'm not insisting on it being included as the natural way to separate words within the PDF, though it certainly is a possible way that is used by other software. (This does not mean that having such a character in the output need affect TeX's view of typesetting.) Clearly you never even read this parenthetical statement ... Before replying to anything in the above paragraph, please watch the video of my recent talk at TUG-2011. ... and certainly you don't seem to have followed up on this piece of advice, to get a better perspective of what I'm talking about. http://river-valley.tv/further-advances-toward-tagged-pdf-for-mathematics/ or similar from earlier years where I also talk a bit about such things. Here is how you get *both* TeX-quality typesetting and explicit spaces as word-boundaries inside the PDF, with no loss of quality. What the experimental tagged-pdfTeX does is to use a font (called dummy-space) that contains just a single character at code Ux0020, at a size that is almost zero -- it cannot be exactly zero, else PDF browsers may not select it for copy/paste, or other text-extraction. These extra spaces are inserted into the PDF content stream, *after* TeX has determined the correct positioning for high-quality typesetting. That is, it is *not* done by macros or widgets or suchlike, but is done internally by the pdfTeX engine at shipout time. The almost-zero size has no perceptible effect on the visual output. But the existence of these extra space characters means that all text-extraction methods work much more reliably. There *are* extra primitives that can be used to turn this off and on in places where such extra spaces are not wanted; e.g. in math. And there is a primitive to insert such a space, in case it is required manually, for whatever reason. All of these primitives are used extensively when generating tagged PDF of mathematical expressions, and are thus available for other usage too. ** Phil. Hope this helps, Ross Ross Moore ross.mo...@mq.edu.au Mathematics Department office: E7A-419 Macquarie University tel: +61 (0)2 9850 8955 Sydney, Australia 2109 fax: +61 (0)2 9850 8114 -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
2011/11/16 Ross Moore ross.mo...@mq.edu.au: On 16/11/2011, at 9:45 AM, Zdenek Wagner wrote: 2011/11/15 Ross Moore ross.mo...@mq.edu.au: What if you really want the Ux00A0 character to be in the PDF? That is, when you copy/paste from the PDF, you want that character to come along for the ride. From the typographical point of view it is the worst of all possible methods. If you really wish it, Maybe you misunderstood what I meant here. I'm not saying that you might want Ux00A0 for *every* place where there is a word-breaking space. Just that there may be individual instance(s) where you have a reason to want it. Just like any other Unicode character, if you want it then you should be able to put it in there. You ARE able to do it. Choose a font with that glyph, set \catcode to 11 or 12 and that's it. What else do you wish to do? That's what XeTeX currently does (with the TeX-wise familiar ASCII exceptions) for any code-point supported by the chosen font. The *really wish it* is the choice of the author, not the software. then do not use TeX but M$ Word or OpenOffice. M$ Word automatically inserts nonbreakable spaces at some points in the text written in Czech. As far as grammer is concerned, it is correct. However, U+00a0 is fixed width. If you look at the output, the nonbreakable spaces are too wide on some lines and too thin on other lines. I cannot imagine anything uglier. I do not disagree with you that this could be ugly. But that is not the point. If you want superior aesthetic typesetting, with nice choices for hyphenation, then don't use Ux00A0. Of course! Whatever the reason for wanting to use this character, there should be a straight-forward way to do it. Using the character itself is: a. the most understandable b. currently works c. requires no special explanation. These are reasons why people might wish it in the source files, not in PDF. Yes. In the source, to have the occasional such character included within the PDF, for whatever reason appropriate to the material being typeset -- whether verbatim, or not. If you wish to take a [part of] PDF and include it in another PDF as is, you can take the PDF directly without the need of grabbing the text. If you are interested in the text that will be retypeset, you have to verify a lot of other things. How is any of this relevant to the current discussion? It was you who came with the argument that you wish to have nonbreakable spaces when copying the text from PDF. If the text contained hyphenated words, you have to join the parts manually. You will have a lot of other work and the time saved by U+00a0 will be negligible. There are tools that may help you to insert nonbreakable spaces. I have even my own special tools written in perl to handle one class of input files that are really plain texts and the result is (almost) correctly marked LaTeX source. All well and good. But how is that relevant to anything I said? See above. -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz Cheers, Ross Ross Moore ross.mo...@mq.edu.au Mathematics Department office: E7A-419 Macquarie University tel: +61 (0)2 9850 8955 Sydney, Australia 2109 fax: +61 (0)2 9850 8114 -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
Ross Moore wrote: Hi Phil, On 16/11/2011, at 10:08 AM, Zdenek Wagner wrote: Not I, Sir : Zdeněk ! ** Phil. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
Hi Zdenek, On 16/11/2011, at 11:19 AM, Zdenek Wagner wrote: Just like any other Unicode character, if you want it then you should be able to put it in there. You ARE able to do it. Choose a font with that glyph, set \catcode to 11 or 12 and that's it. What else do you wish to do? The *default* behaviour should stay as this. Any other behaviour needs to change the catcode and make perhaps a definition. These are reasons why people might wish it in the source files, not in PDF. Yes. In the source, to have the occasional such character included within the PDF, for whatever reason appropriate to the material being typeset -- whether verbatim, or not. If you wish to take a [part of] PDF and include it in another PDF as is, you can take the PDF directly without the need of grabbing the text. If you are interested in the text that will be retypeset, you have to verify a lot of other things. How is any of this relevant to the current discussion? It was you who came with the argument that you wish to have nonbreakable spaces when copying the text from PDF. No. I said that if you put one in, then you should be expecting to get one out. This should be the default behaviour, as it is now. I certainly suggested nothing like getting out non-breaking spaces as a replacement for anything else. Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz Hope this helps, Ross Ross Moore ross.mo...@mq.edu.au Mathematics Department office: E7A-419 Macquarie University tel: +61 (0)2 9850 8955 Sydney, Australia 2109 fax: +61 (0)2 9850 8114 -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
msk...@ansuz.sooke.bc.ca wrote: various points with which I have no reason to disagree at this time, followed by 2. Inevitably, people will include invalid characters in TeX input; and U+00A0 is an invalid character for TeX input. Firstly (as is clear from the list on which we are discussing this), we are not discussing TeX but XeTeX. Secondly, even if we were discussing TeX, on what basis do you claim that U+00A0 is invalid ? And if you assert that it is, /a priori/, invalid for TeX, and if your reasons for that assertion are sound, do they also support the assertion that it is, /a priori/, invalid for XeTeX ? Remainder snipped, so that we can debate one point at a time. Philip Taylor -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
2011/11/14 Philip TAYLOR p.tay...@rhul.ac.uk: msk...@ansuz.sooke.bc.ca wrote: various points with which I have no reason to disagree at this time, followed by 2. Inevitably, people will include invalid characters in TeX input; and U+00A0 is an invalid character for TeX input. Firstly (as is clear from the list on which we are discussing this), we are not discussing TeX but XeTeX. Secondly, even if we were discussing TeX, on what basis do you claim that U+00A0 is invalid ? And if you assert that it is, /a priori/, invalid for TeX, and if your reasons for that assertion are sound, do they also support the assertion that it is, /a priori/, invalid for XeTeX ? Remainder snipped, so that we can debate one point at a time. I agree with Phil there is nothing in TeX that makes a character invalid a priori. It is made invalid by \catcode. There are two aspects: A. We are preparing a document to be typeset by TeX. Why on earth should we use only U+00a0 and not ~ which is clearly visible in any editor and has been used for a nonbreakable space for years? Why we use in \halign or \begin{tabular} and not U+0009? B. TeX is used to typeset data extracted from a database (or similar source) that was not TeX-aware at the first place. Such data can contain not only U+00a0 but even texts as Tweedledum Tweedledee, 12 $, 15 %, #1, whatever. In such a case we must be aware that the input may contain arbitrary characters, even those playing special roles in TeX. We have to handle them properly. Philip Taylor -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
On Mon, 14 Nov 2011, Philip TAYLOR wrote: 2. Inevitably, people will include invalid characters in TeX input; and U+00A0 is an invalid character for TeX input. Firstly (as is clear from the list on which we are discussing this), we are not discussing TeX but XeTeX. Secondly, even XeTeX is a TeX engine. Obviously, it is free to define its own input format, and that format already differs from other TeX engines by (for instance) allowing some Unicode code points outside the 7-bit range. But I still see XeTeX as a version of TeX, not something completely different, and it's appropriate for expectations we might have about TeX - for instance, the expectation that formatting commands are visible and the non-breaking space formatting command is ~ - to also apply to XeTeX where they are appropriate. if we were discussing TeX, on what basis do you claim that U+00A0 is invalid ? And if you assert that it is, /a priori/, It's invalid if XeTeX says it is invalid, and I think XeTeX should say it is invalid. -- Matthew Skala msk...@ansuz.sooke.bc.ca People before principles. http://ansuz.sooke.bc.ca/ -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
msk...@ansuz.sooke.bc.ca wrote: XeTeX is a TeX engine. Obviously, it is free to define its own input format, and that format already differs from other TeX engines by (for instance) allowing some Unicode code points outside the 7-bit range. I think (with respect) that some Unicode code points outside the 7-bit range is a gross understatement. As far as I am aware, XeTeX permits a very considerable subset of Unicode (perhaps even all of it; I do not know) as input. if we were discussing TeX, on what basis do you claim that U+00A0 is invalid ? And if you assert that it is, /a priori/, It's invalid if XeTeX says it is invalid, and I think XeTeX should say it is invalid. That is a very different statement, and as that is your personal position, I respect it as such. Of course, I disagree :-) ** Phil. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
On Mon, 14 Nov 2011, Philip TAYLOR wrote: I think (with respect) that some Unicode code points outside the 7-bit range is a gross understatement. As far as I am aware, XeTeX permits a very considerable subset of Unicode (perhaps even all of it; I do not know) as input. My point is that it shouldn't treat U+00A0 as equivalent to U+007E, or as valid at all, just because it supports Unicode. That is not what supporting Unicode means. -- Matthew Skala msk...@ansuz.sooke.bc.ca People before principles. http://ansuz.sooke.bc.ca/ -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
Am 14.11.2011 18:30, schrieb msk...@ansuz.sooke.bc.ca: 1. No. That is not what Unicode is for. Unicode's goal is to subsume all reasonable pre-existing encodings. Unicode is even more. Look at all the Annexes to Unicode 6.0 Some reasonable pre-existing encodings include a non-breaking space character, so Unicode includes one. That does not mean Unicode says you should actually use it! There are many precedents of Unicode providing multiple ways of representing things, as a result of including characters from other systems, without it being reasonable to demand that all Unicode-compatible systems must support all of them. For instance, most of the U+FFxx range is devoted to different kinds of hacks for handling partial-width characters in Asian-language typesetting; the preferred way to do that nowadays is via OpenType features, but the code points remain in the standard. The U+ to U+001F range is basically control characters for Teletype machines; some of those, like U+000A and U+000D, are widely used in modern documents (but in varying ways by different systems!) and others, like U+001D, are virtually unheard-of. Unicode does NOT say everybody has to support them all let alone all in the same way. Hmm, I have difficulties exactly understanding the conformance chapter of Unicode 6.0 ( http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf ), but it seems to me, that claiming unicode support seems a very strong statement. The U+00A0 code points is not explicitly deprecated in Unicode, but it was never a principle of Unicode that all implementations have to support all defined control characters regardless of appropriateness to the particular purpose. Non-breaking space is, from TeX's point of view, not really a character at all, but a formatting command; and TeX already has a way of dealing with formatting commands in general and this one in particular. It is appropriate to say that the preferred way of handling non-breaking spaces in TeX input is the existing TeX way; and saying that in NO WAY AT ALL contradicts anything in Unicode. Unicode is servant, not master. I think it's more like math being servant _and_ master of natural sciences. 2. Inevitably, people will include invalid characters in TeX input; and U+00A0 is an invalid character for TeX input. The best way to deal with it is to treat it like any other invalid character and generate an error message. A reasonable alternative would be to say it is whitespace; it will be treated like other whitespace. That would mean ignoring its breaking/non-breaking-ness, as we have for a long time similarly ignored the special properties of U+0009 (tab). Of course, if users want to define a special meaning for U+00A0 in their own input, they can do so with the existing mechanisms for redefining the meanings of input characters; but U+00A0 is equivalent to U+007E (~), for instance, should never be the default and (because of trouble displaying it) shouldn't be encouraged. Now we come to the trouble of Unicode specifying a line-breaking algorithm ( http://www.unicode.org/reports/tr14/tr14-26.html ), which probably isn't exactly TeX's. I'm not into these algorithms, so I can't compare. But I would ask some Master of this Art to speak up about this conflict. 3. No. Better to keep everything visible and backward compatible. U+007E (~) should remain the preferred way of doing non-breaking space. Should and is … (see other posts). 4. Not applicable because of the answer to #3. Users who do insist on putting U+00A0 in their input presumably have *already* got their own reasons to think that it's more convenient for them, including solutions satisfactory to themselves for how to type it on keyboards and see it on screens, so that's their business and not a problem we need to solve. I'm personally trying hard to find a correct way. As of now, I have found a very simple solution to input special whitespace characters. (Using Linux, doing this is easy business with ibus.) Alas, I haven't found any editor suited better to my TeX needs than Kile, but I haven't yet managed to highlight these special whitespace characters properly. = Some experts can do all these things. That doesn't mean, everyone else should stick do stupid old ASCII-7. bye Toscho -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
msk...@ansuz.sooke.bc.ca wrote: On Mon, 14 Nov 2011, Philip TAYLOR wrote: I think (with respect) that some Unicode code points outside the 7-bit range is a gross understatement. As far as I am aware, XeTeX permits a very considerable subset of Unicode (perhaps even all of it; I do not know) as input. My point is that it shouldn't treat U+00A0 as equivalent to U+007E, or as valid at all, just because it supports Unicode. That is not what supporting Unicode means. I agree with your opinion that it should not treat U+00A0 as equivalent to U+007E -- indeed, the Unicode standard specifies as its canonical decomposition : noBreak SPACE (U+0020) However, I cannot agree that it should not be treated as valid; that is just the thin end of the wedge, and I would sooner there were no wedge at all. XeTeX's primary strength is that it supports Unicode; we should not weaken that strength by requiring that it supports some parts of Unicode and not others. My EUR 0,02. ** Phil. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
On Mon, Nov 14, 2011 at 12:15 PM, in message 4ec14cb5.7000...@rhul.ac.uk, Philip TAYLOR p.tay...@rhul.ac.uk wrote: XeTeX is a TeX engine. Obviously, it is free to define its own input format, and that format already differs from other TeX engines by (for instance) allowing some Unicode code points outside the 7-bit range. I think (with respect) that some Unicode code points outside the 7-bit range is a gross understatement. As far as I am aware, XeTeX permits a very considerable subset of Unicode (perhaps even all of it; I do not know) as input. I use U+12000 and above regularly, as a case in point... K -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
On Mon, 14 Nov 2011, Karljurgen Feuerherm wrote: I use U+12000 and above regularly, as a case in point... Do you think that basic formatting control functions should be bound to code points in that range, as the preferred way of accessing those functions? Let's not lose track of what this discussion is about. XeTeX can *with appropriate font support* accept nearly any Unicode point in its input. But very few Unicode points are treated specially by XeTeX as such, and I don't think U+00A0 should be one of them. -- Matthew Skala msk...@ansuz.sooke.bc.ca People before principles. http://ansuz.sooke.bc.ca/ -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
I didn't say anything about U+00A0 one way or the other Keeping in mind that the purpose of this software is to get work done, and not to fulfil anyone's philosophical notions of software, my general feeling is that: * Xe(La)TeX should support plain text characters--for *my* present purpose, meaning characters which are printable, pure and simple, regardless of where in the Unicode space they are; as far as I know, this is the case now (and my case in point was more or less just aimed at this issue); * it should support whatever other characters are necessary to complex rendering, if it doesn't already; * optionally it can/could support whatever else, as the in-the-flesh maintainers of the package have time and leisure to implement. I said 'feel', because it seems to me all very well for the rest of us to debate philosophy back and forth, but unless we're doing the actual work As someone has already pointed out, lots of what is in Unicode is there because it is UNI-code. It may very well have outlived its usefulness, at least in the context of Xe(La)TeX doing the work one would like it to do. Just because something is in Unicode doesn't mean one has to want to use it. In fact, the more unnecessary things one implements, the better the chance of instability. There are no doubt multiple ways to achieve this pragmatically stated goal. I don't feel any vested interest in dictating to anyone the preference for how to go about it. K On Mon, Nov 14, 2011 at 2:15 PM, in message alpine.lnx.2.00.141312201.3...@tetsu.ansuz.sooke.bc.ca, msk...@ansuz.sooke.bc.ca wrote: On Mon, 14 Nov 2011, Karljurgen Feuerherm wrote: I use U+12000 and above regularly, as a case in point... Do you think that basic formatting control functions should be bound to code points in that range, as the preferred way of accessing those functions? Let's not lose track of what this discussion is about. XeTeX can *with appropriate font support* accept nearly any Unicode point in its input. But very few Unicode points are treated specially by XeTeX as such, and I don't think U+00A0 should be one of them. -- Matthew Skala msk...@ansuz.sooke.bc.ca People before principles. http://ansuz.sooke.bc.ca/ -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Whitespace in input
msk...@ansuz.sooke.bc.ca wrote: various points with which I have no reason to disagree at this time, followed by 2. Inevitably, people will include invalid characters in TeX input; and U+00A0 is an invalid character for TeX input. Firstly (as is clear from the list on which we are discussing this), we are not discussing TeX but XeTeX. Secondly, even if we were discussing TeX, on what basis do you claim that U+00A0 is invalid ? And if you assert that it is, /a priori/, invalid for TeX, and if your reasons for that assertion are sound, do they also support the assertion that it is, /a priori/, invalid for XeTeX ? Remainder snipped, so that we can debate one point at a time. Philip Taylor -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex