On Wed, 18 Jan 2023 at 16:30:46 -0700, Anthony Fok wrote: > In their mind, GB 18030 encompasses a lot more than just > a character encoding mapping table. It is the full support package > (including fonts, display, printing, input methods, etc.) for Han > Chinese and all other minority languages used in China.
If I'm reading correctly, the character encoding part of GB 18030-2022 is a subset of a sufficiently new version of Unicode, in the same way that (say) ISO-8859-15 is a subset of Unicode: for every character representable in GB 18030-2022, you can point at an equivalent Unicode character and say "this is the GB 18030-2022 encoding of U+4E00" or similar? Is that true? If that's the case, then supporting text files written in GB 18030 does not *necessarily* require the internal representation or the system locale to be GB 18030, the same way I can still work with legacy en_GB.ISO-8859-15 files on my en_GB.UTF-8 system: it could equally well be done by using iconv() or equivalent to transcode to UTF-8, UTF-16 or UCS-4 on input, doing all text editing operations on that Unicode, and then transcoding back into GB 18030 on output. Most language frameworks already do this as a matter of API: Qt, Java and Windows tend to work with UTF-16 internally, while GLib/GTK uses UTF-8 internally. iconv() seems very unlikely to drop support for GB 18030, ISO-8859-15 and other non-Unicode encodings altogether. What this bug report is about is dropping support for locales whose associated encoding is non-Unicode, such as en_GB.ISO-8859-15 and zh_CN.GB18030, so that the data stream between a CLI program and the terminal emulator will be assumed to be UTF-8 instead of ISO-8859-15 or GB18030. The main thing I can see that would be a problem for GB 18030 users if the zh_CN.GB18030 locale was dropped is that various programs might assume that the locale encoding is the right one to assume when loading existing files and unable to guess the encoding, or the right one to write into new files by default - and so users who have moved from zh_CN.GB18030 to zh_CN.UTF-8 might find themselves unintentionally producing new UTF-8 files. Preferring to use Unicode does seem to be the direction that all of computing is going in, as a simplifying assumption - for example W3C advice for HTML is "You should always use the UTF-8 character encoding"[1] - and as we know, things that aren't tested usually don't work. So I think the level of functionality for non-UTF-8 locales and encodings in the software we package is going to decline over time, whether Debian wants it to or not. smcv [1] https://www.w3.org/International/questions/qa-html-encoding-declarations