Steven D'Aprano wrote: > Marko Rauhamaa wrote: > >> Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info>: >> >>> Marko Rauhamaa wrote: >>> >>>> That said, UTF-8 does suffer badly from its not being >>>> a bijective mapping. >>> >>> Can you explain? >> >> In Python terms, there are bytes objects b that don't satisfy: >> >> b.decode('utf-8').encode('utf-8') == b > > Are you talking about the fact that not all byte streams are valid UTF-8? > That is, some byte objects b may raise an exception on b.decode('utf-8').
Eh, I should have read the rest of the thread before replying... > I don't see why that means UTF-8 "suffers badly" from this. Can you give > an example of where you would expect to take an arbitrary byte-stream, > decode it as UTF-8, and expect the results to be meaningful? File names on Unix-like systems. Unfortunately file names are a bit of a mess, but we're slowly converging on Unicode support for files. I reckon that by 2070, 2080 tops, we'll have that licked... The three major operating systems have different levels of support for Unicode file names: * Apple OS X: HFS+ stores file names in decomposed form, using UTF-16. I think this is the strictest Unicode support of all common file systems. Well done Apple. Decomposed in this sense means that single code points may be expanded where possible, e.g. é U+00E9 LATIN SMALL LETTER E WITH ACUTE will be stored as two code points, U+0065 LATIN SMALL LETTER E + U+0301 COMBINING ACUTE ACCENT. * Windows: NTFS stores file names as sequences of 16-bit code units except 0x0000. (Additional restrictions also apply: e.g. in POSIX mode, / is also forbidden; in Win32 mode, / ? + etc. are forbidden.) The code units are interpreted as UTF-16 but the file system doesn't prevent you from creating file names with invalid sequences. * Linux: ext2/ext3 stores file names as arbitrary bytes except for / and nul. However most Linux distributions treat file names as if they were UTF-8 (displaying ? glyphs for undecodable bytes), and many Linux GUI file managers enforce the rule that file names are valid UTF-8. File systems on removable media (FAT32, UDF, ISO-9660 with or without extensions such as Joliet and Rock Ridge) have their own issues, but generally speaking don't support Unicode well or at all. So although the current situation is still a bit of a mess, there is a slow move towards file names which are valid Unicode. -- Steven -- https://mail.python.org/mailman/listinfo/python-list