--- Jonadab the Unsightly One <[EMAIL PROTECTED]> wrote: > > Have the implications of the bytes/codepoints/graphemes/woohickies > distinction for the regular expression engine been discussed already?
Not enough. One of my current clients just rolled on to redhat 9, and what a steaming pile of digestive byproducts *that* turned out to be. Apparently the default locale setting changed, so now LC_ALL="" out of the box. One effect of this is irritating lack of proper behavior in the utilities. But when you switch to LC_ALL= <pick your favorite language>, you just get really slow performance: Apparently the 'C' locale is such a totally special case that the performance of LC_ALL=C is one or more orders of magnitude better than LC_ALL=en_US.UTF-8, even when the data is 7bit ascii. I think that (1) this is unacceptable: the temptation to switch to the 'C' locale has been too great, both at this site and on a lot of the RH support forums; (2) Perl6 should equitably support all its target locales; (3) we should set out to make sure the performance is damn fast no matter what locale we're using. This has no direct bearing on p6l, since performance is a p6i issue. But perhaps in the interests of performance as well as hackery we should explicitly provide some sort of variant regex behavior: /a./ :bytes /a./ :graphemes where the first would recognize 0x61 followed by any single byte, while the second would recognize 'a' followed by any number of bytes composing a single grapheme. (I'll claim that it's legitimate to want to search for, say, any MBCs introduced via \x0F\x01, regardless of length. This is likely not supported any other way.) =Austin