On Fri, 2006-10-27 at 14:29 +0200, Peter Kümmel wrote:

> Unicode is Unicode. I assume XeTex does not support all Unicode
> symbols but only these which could be represented by ONE 16 bit
> UTF-16 value (like the Mac?). I think this means "XeTeX uses utf-16
> encoding internally".

In TugBoat, #2-2005, in the article about XeTeX it is written:

<quote>
As Unicode scalar values may be up to U+10FFFF, an obvious modification
would be to make “characters” 32 bits wide, and treat Unicode characters
as the basic units of text. 

However, in XETEX a pragmatic decision was made to work internally with
UTF-16 as the encoding form, making “characters” in the engine 16 bits
wide, and handling supplementary-plane characters using UTF-16 surrogate
pairs. This choice was made for a number of reasons:

• The operating-system APIs that XETEX uses in working with Unicode text
require UTF-16, so working with this encoding form avoids the need for
conversion.

• A number of internal arrays in TEX are indexed by character codes.
Enlarging these from 256 elements each to 65,536 elements seems
reasonable; enlarging them to a million-plus elements each would
dramatically increase the memory footprint of the system.
To avoid this, a sparse array implementation might be used, but this
would be significantly more complex to develop and test, and might well
have a negative impact on typesetting performance.

• It seems unlikely, in any case, that there will be much need to
customize these properties (see next section) for characters beyond
Plane 0.

In view of these factors, XETEX works with UTF-16 code units. Unicode
characters beyond U+FFFF can still be included in documents, however,
and will render correctly (given appropriate fonts) as the UTF-16
surrogate pairs will be passed to the font system.

</quote>


> The files are stored in UTF-8 by lyx and should be readable by XeTex,
> without any unicode conversion programs (I could not imagine that
> XeTex couldn't read UTF-8 files).

Right.

<quote>
While XETEX is designed to work with Unicode throughout the typesetting
process, users may well wish to typeset text that is in a different
encoding. By default, XETEX interprets input text as being UTF-8,
converting multi-byte sequences to Unicode character codes
appropriately, unless inspection of the file suggests that the text is
UTF-16 (identified by a Byte Order Mark code, or by null high-order
bytes in the initial 16-bit code units). Either way, the input is
assumed to be valid Unicode.
</quote>

So, by getting LyX work with Unicode in utf-8, we will get XETEX support
as well :-)

However, no fun for LyX-1.4.x

Sincerely,
Gour




Reply via email to