Re: POSIX character classes (was Re: pipes and sub-shells)

2017-03-24 Thread Martijn Dekker
Op 23-03-17 om 22:02 schreef Thorsten Glaser:
> Martijn Dekker dixit:
> 
>> * BUG_NOCHCLASS: POSIX-mandated character [:classes:] within bracket
>> [expressions] are not supported in glob patterns.
> 
> I really really REALLY hate that this will make mksh really big.
> We’re talking about 36K .rodata even without titlecase conversion
> and BMP-only (16-bit Unicode) here.

I sympathise.

Even fnmatch(3) is not compliant on all systems; the BSDs don't seem to
have caught up yet. :(  I don't suppose using that is an option in any
case because of mksh's extended globbing functionality.

Is adding 36k really that much in 2017? On my system, the current
development binary of mksh is 283k after stripping when built with -O2,
235k with -Os. Adding 36k would make it 316k/271k, still quite small.

If that's too much, I guess you should continue to not support them. The
reason modernish detects BUG_NOCHCLASS is not to make some sort of
statement, but to enable programs using the library to easily check for
the presence of the issue and implement alternative methods (such as
falling back to external commands, or just matching ASCII only without
character classes).

> Can I get by making them match ASCII only even in UTF-8 mode?

IMHO, that would defeat their primary purpose, namely locale-dependent
class matching, so no, not really. :)

If Greeks or Russians (or Germans, for that matter) can't count on
[:upper:] matching an upper case letter in their alphabets, then I'd say
for them it would be better to have no support than broken support.

> Strictly speaking, POSIX requires only support for the C locale,
[...]

Yes, but on systems supporting other locales (e.g. UTF-8), it would not
be conforming for character classes to match ASCII only. You either
support UTF-8 or you don't.

- M.



POSIX character classes (was Re: pipes and sub-shells)

2017-03-23 Thread Thorsten Glaser
Martijn Dekker dixit:

>* BUG_NOCHCLASS: POSIX-mandated character [:classes:] within bracket
>[expressions] are not supported in glob patterns.

I really really REALLY hate that this will make mksh really big.
We’re talking about 36K .rodata even without titlecase conversion
and BMP-only (16-bit Unicode) here.

Can I get by making them match ASCII only even in UTF-8 mode?

Strictly speaking, POSIX requires only support for the C locale,
and our UTF-8 mode is only close to POSIX anyway, and currently
(though this will change, there have been good points made for
locale tracking) enabled using a mksh-specific set flag.

If I implemented that, we could then say that “lksh -o posix”
is, in the C locale, fully POSIX conformant. (Perhaps — but
certainly a goal to work for, even despite the uselessness
of standards.)

bye,
//mirabilos
-- 
(gnutls can also be used, but if you are compiling lynx for your own use,
there is no reason to consider using that package)
-- Thomas E. Dickey on the Lynx mailing list, about OpenSSL