Re: POSIX character classes
Dixi quod… >>be conforming for character classes to match ASCII only. You either >>support UTF-8 or you don't. > >For POSIX purposes, we really don’t, as we use our own routines […] >The question was more whether [[:upper:]] matching [A-Z] would >be more useful than not matching anything at all. It will now be done that way: [[:upper:]] will match exactly the same as [A-Z] (and only those 26 chars, even on EBCDIC). Globbing is on octet level, not on (multibyte) character level, although I reserve to change that for when utf8-mode is enabled later, and for lksh (which may, or may not, eventually use the underlying OS-provided locale functions). Furthermore, “set -o posix” and “set -o sh” will now turn off utf8-mode in addition to braceexpand. This is not only better than nothing, it also fixes installation of the Debian postfix package since its maintainer is too… unex‐ perienced… to realise that writing [A-Z] is not only shorter than using the character classes but also less prone to locale-dependent surprises. Thus, expect an mksh R56 coming up Really Soon Now™. bye, //mirabilos -- 22:20⎜ The crazy that persists in his craziness becomes a master 22:21⎜ And the distance between the craziness and geniality is only measured by the success 18:35⎜ "Psychotics are consistently inconsistent. The essence of sanity is to be inconsistently inconsistent
Re: POSIX character classes
Martijn Dekker dixit: >> Can I get by making them match ASCII only even in UTF-8 mode? > >IMHO, that would defeat their primary purpose, namely locale-dependent >class matching, so no, not really. :) > >If Greeks or Russians (or Germans, for that matter) can't count on >[:upper:] matching an upper case letter in their alphabets, then I'd say There’s no alphabets in UTF-8, only global Unicode. >> Strictly speaking, POSIX requires only support for the C locale, >[...] > >Yes, but on systems supporting other locales (e.g. UTF-8), it would not >be conforming for character classes to match ASCII only. You either >support UTF-8 or you don't. For POSIX purposes, we really don’t, as we use our own routines to read and write multibyte characters and handle them as wide characters internally. We _really_ cannot use POSIX locales in mksh at all. So if a system has 32-bit wchar_t and supports the Unicode astral planes, mksh isn’t conforming in UTF-8 mode there either. (POSIX does, however, not demand UTF-8 or Unicode support at all, only the C locale, so that’s okay.) The question was more whether [[:upper:]] matching [A-Z] would be more useful than not matching anything at all. bye, //mirabilos -- “It is inappropriate to require that a time represented as seconds since the Epoch precisely represent the number of seconds between the referenced time and the Epoch.” -- IEEE Std 1003.1b-1993 (POSIX) Section B.2.2.2
Re: POSIX character classes (was Re: pipes and sub-shells)
Op 23-03-17 om 22:02 schreef Thorsten Glaser: > Martijn Dekker dixit: > >> * BUG_NOCHCLASS: POSIX-mandated character [:classes:] within bracket >> [expressions] are not supported in glob patterns. > > I really really REALLY hate that this will make mksh really big. > We’re talking about 36K .rodata even without titlecase conversion > and BMP-only (16-bit Unicode) here. I sympathise. Even fnmatch(3) is not compliant on all systems; the BSDs don't seem to have caught up yet. :( I don't suppose using that is an option in any case because of mksh's extended globbing functionality. Is adding 36k really that much in 2017? On my system, the current development binary of mksh is 283k after stripping when built with -O2, 235k with -Os. Adding 36k would make it 316k/271k, still quite small. If that's too much, I guess you should continue to not support them. The reason modernish detects BUG_NOCHCLASS is not to make some sort of statement, but to enable programs using the library to easily check for the presence of the issue and implement alternative methods (such as falling back to external commands, or just matching ASCII only without character classes). > Can I get by making them match ASCII only even in UTF-8 mode? IMHO, that would defeat their primary purpose, namely locale-dependent class matching, so no, not really. :) If Greeks or Russians (or Germans, for that matter) can't count on [:upper:] matching an upper case letter in their alphabets, then I'd say for them it would be better to have no support than broken support. > Strictly speaking, POSIX requires only support for the C locale, [...] Yes, but on systems supporting other locales (e.g. UTF-8), it would not be conforming for character classes to match ASCII only. You either support UTF-8 or you don't. - M.
POSIX character classes (was Re: pipes and sub-shells)
Martijn Dekker dixit: >* BUG_NOCHCLASS: POSIX-mandated character [:classes:] within bracket >[expressions] are not supported in glob patterns. I really really REALLY hate that this will make mksh really big. We’re talking about 36K .rodata even without titlecase conversion and BMP-only (16-bit Unicode) here. Can I get by making them match ASCII only even in UTF-8 mode? Strictly speaking, POSIX requires only support for the C locale, and our UTF-8 mode is only close to POSIX anyway, and currently (though this will change, there have been good points made for locale tracking) enabled using a mksh-specific set flag. If I implemented that, we could then say that “lksh -o posix” is, in the C locale, fully POSIX conformant. (Perhaps — but certainly a goal to work for, even despite the uselessness of standards.) bye, //mirabilos -- (gnutls can also be used, but if you are compiling lynx for your own use, there is no reason to consider using that package) -- Thomas E. Dickey on the Lynx mailing list, about OpenSSL