robert dockins wrote: > > I don't pretend to fully understand various unicode standard but it > > seems to me that these problems are deeper than file path library. The > > equation (decode . encode) > > /= id seems confusing for me. Can you give me an example when this > > happen? > > I am pretty sure that ISO 2022 encoded strings can have multiple ways to > express the same unicode glyphs. This means that any sensible relation > between IS0 2022 strings and unicode strings maps more than one ISO 2022 > string onto the same unicode string. The inverse is therefore not a > function. To make it a function one of the possibly several encodings > of the unicode string will have to be chosen. So you have a ISO 2022 > string A which is decoded to a unicode string U. We reencode U to an > ISO 2022 string B. It may be that A /= B. That is the problem.
Exactly. And it isn't a theoretical issue. E.g. in an environment where EUC-JP is used, filenames may begin with <ESC>$)B (designate JISX0208 to G1), or they may not (because G1 is assumed to contain JISX0208 initally). More generally, ISO-2022 strings frequently contain redundant character-set switching sequences, so conversion to unicode and back again typically won't yield the original sequence of bytes. > The various UTF encodings do not have this particular problem; if a UTF > string is valid, then it is a unique representation of a unicode string. Except that there are some ad-hoc extensions, e.g. the UTF-8 variant used by both Java and Tcl permits NUL characters to be embedded in NUL-terminated UTF-8 strings by encoding them as a two-byte sequence (which is invalid in UTF-8 proper). -- Glynn Clements <[EMAIL PROTECTED]> _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe