On Sat, Oct 01, 2005 at 01:02:31AM +0200, Lennart Borgman wrote: > Piet van Oostrum wrote: [...] > >That is just the internal representation of the character in Emacs. It's > >not important. What matters is what Emacs writes to your file. When you > >write out utf-8 (for example by giving the command [...] > So you mean that at a - what should I call it? - "text semantic level" > the utf-8 char and the latin-1 char has the same meaning?
Yes. You put that nicely. The *character* (a dieresis) stays the same. The *representation* (loosely referred to as `encoding') changes. I said loosely, because on more complex things as utf-8 there are actually two layers: the `character set', mapping each character to an integer (aka `code point', which in this case would be UNICODE or ISO-10646, which nowadays are equivalent), and the representation in a file, which may be utf-8 (most common), ucs-16 or whatnot. Now the advantage of utf-8: it is a variable-width encoding, and uses up just one byte for one ASCII character (on ASCII it uses the same code points). So you can interpret an ASCII file ``as-is'' as an utf-8 file. For higher characters (the ones, for example with codes >127 in iso-8859-1 (aka Latin1)), you need more than one byte in utf-8. AFAIK, up to 6 bytes, but don't take that too seriously. The disadvantage is: it is a variable-width encoding, so you have to process a text sequentially, byte-for-byte to get the character boundaries right (it's designed to re-synchronize gracefully, though). If you want the whole story (on UNICODE, ISO10646, UTF8), see here: <http://www.cl.cam.ac.uk/~mgk25/unicode.html> (very recommended). From the perspective of a web slave, see: <http://www.w3.org/TR/REC-html40/charset.html> HTH -- tomas
signature.asc
Description: Digital signature
_______________________________________________ Emacs-devel mailing list Emacs-devel@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-devel