--- Jonadab the Unsightly One <[EMAIL PROTECTED]> wrote:
> 
> Have the implications of the bytes/codepoints/graphemes/woohickies
> distinction for the regular expression engine been discussed already?

Not enough.

One of my current clients just rolled on to redhat 9, and what a
steaming pile of digestive byproducts *that* turned out to be.
Apparently the default locale setting changed, so now LC_ALL="" out of
the box.

One effect of this is irritating lack of proper behavior in the
utilities. But when you switch to LC_ALL= <pick your favorite
language>, you just get really slow performance: Apparently the 'C'
locale is such a totally special case that the performance of LC_ALL=C
is one or more orders of magnitude better than LC_ALL=en_US.UTF-8, even
when the data is 7bit ascii.

I think that (1) this is unacceptable: the temptation to switch to the
'C' locale has been too great, both at this site and on a lot of the RH
support forums; (2) Perl6 should equitably support all its target
locales; (3) we should set out to make sure the performance is damn
fast no matter what locale we're using.

This has no direct bearing on p6l, since performance is a p6i issue.
But perhaps in the interests of performance as well as hackery we
should explicitly provide some sort of variant regex behavior:

    /a./ :bytes
    /a./ :graphemes

where the first would recognize 0x61 followed by any single byte, while
the second would recognize 'a' followed by any number of bytes
composing a single grapheme.

(I'll claim that it's legitimate to want to search for, say, any MBCs
introduced via \x0F\x01, regardless of length. This is likely not
supported any other way.)

=Austin

Reply via email to