On Fri, 2006-10-27 at 14:29 +0200, Peter Kümmel wrote: > Unicode is Unicode. I assume XeTex does not support all Unicode > symbols but only these which could be represented by ONE 16 bit > UTF-16 value (like the Mac?). I think this means "XeTeX uses utf-16 > encoding internally".
In TugBoat, #2-2005, in the article about XeTeX it is written: <quote> As Unicode scalar values may be up to U+10FFFF, an obvious modification would be to make “characters” 32 bits wide, and treat Unicode characters as the basic units of text. However, in XETEX a pragmatic decision was made to work internally with UTF-16 as the encoding form, making “characters” in the engine 16 bits wide, and handling supplementary-plane characters using UTF-16 surrogate pairs. This choice was made for a number of reasons: • The operating-system APIs that XETEX uses in working with Unicode text require UTF-16, so working with this encoding form avoids the need for conversion. • A number of internal arrays in TEX are indexed by character codes. Enlarging these from 256 elements each to 65,536 elements seems reasonable; enlarging them to a million-plus elements each would dramatically increase the memory footprint of the system. To avoid this, a sparse array implementation might be used, but this would be significantly more complex to develop and test, and might well have a negative impact on typesetting performance. • It seems unlikely, in any case, that there will be much need to customize these properties (see next section) for characters beyond Plane 0. In view of these factors, XETEX works with UTF-16 code units. Unicode characters beyond U+FFFF can still be included in documents, however, and will render correctly (given appropriate fonts) as the UTF-16 surrogate pairs will be passed to the font system. </quote> > The files are stored in UTF-8 by lyx and should be readable by XeTex, > without any unicode conversion programs (I could not imagine that > XeTex couldn't read UTF-8 files). Right. <quote> While XETEX is designed to work with Unicode throughout the typesetting process, users may well wish to typeset text that is in a different encoding. By default, XETEX interprets input text as being UTF-8, converting multi-byte sequences to Unicode character codes appropriately, unless inspection of the file suggests that the text is UTF-16 (identified by a Byte Order Mark code, or by null high-order bytes in the initial 16-bit code units). Either way, the input is assumed to be valid Unicode. </quote> So, by getting LyX work with Unicode in utf-8, we will get XETEX support as well :-) However, no fun for LyX-1.4.x Sincerely, Gour
