> PPS. AFAIK UTF-8 is not used internally in any OS - it's only > used for storing > UNICODE text in more compact form - web site authors really like it.
i belive a lot of linux distros are switching to it for the console at least for less common languages i don't know how gui stuff on linux handles text. The windows routines for going from utf-16 to local codesets and back can also go from utf-16 to utf-7 and utf-8 and back but i don't think windows itself actually makes any real use of those encodings. UTF-8 is smaller than UTF-16 in some cases larger in others and about the same in still others it largely depends on what code points dominate the text. An appropriate national encoding will usually always beat both of them if it can represent the needed code points. mainly $000000-$00007F utf-8 : 1 byte utf-16: 2 bytes utf-32 4 bytes. mainly $000080-$0007FF utf-8 : 2 bytes utf-16: 2 bytes utf-32 4 bytes. mainly $000800-$00FFFF utf-8 : 3 bytes utf-16: 2 bytes utf-32 4 bytes. mainly $010000-$10FFFF utf-8 : 4 bytes utf-16: 4 bytes utf-32 4 bytes. the net result is that utf-8 tends to win for largely latin languages UTF-16 tends to win for largely ideographic languages and they are about on a par for everything else. utf-32 nearly always loses to both (though it does have a large spare codespace which can be used for special meanings internal to the app). the main advatages of utf-8 over utf-16 are 1: is a superset of 7 bit ascii 2: its not peppperd with 0 bytes. 3: any charachtor can ONLY be represented by 1 byte pattern and that byte patten can ONLY represent that charachtor (it can't be a part of another charachtor) 4: its easy to resync a badly cut/joined stream (if you cut a utf-16 stream in the middle of a charachtor on of the peices will be total garbage). the net result is that most code designed to deal with "ascii with extentions" can be fed utf-8 and it will usually work fine or only require minimal changes. i still belive that the best way to handle ansistring<-->widestring conversion is to use a fallback conversion (either 7 bit ascii or iso-8859-1) by default and then provide units that override the conversion with versions based on the local charset of the environment or a charset specified by the application coder. Unfortunately as i have said whilst there is an interface in place for overriding the conversion it is currently only usable where the local code is single byte rather than mixed width. _______________________________________________ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel