This is a long reply, but I read it over a few times, and I don't see any fat to trim. This isn't really a simple issue for which intuition is going to be a sufficient guide, though I agree fully that it needs to be high on or at the top of the list.
On Sun, Jul 18, 2010 at 6:26 AM, Moritz Lenz <mor...@faui2k3.org> wrote: > In general, stuffing more complex behaviour into something that feels > unintuitive is rarely (if ever) a good solution. Walk with me a bit, and let's explore the concept of intuitive character ranges? This was my suggestion, which seems pretty basic to me: "x .. y", for all strings x and y, which are composed of a single, valid codepoint which is neither combining nor modifying, yields the range of all valid, non-combining/modifying codepoints between x and y, inclusive which share the Unicode script, general category major property and general category minor property of either x or y (lack of a minor property is a valid value). In general we have four problems with current specification and implementation on the Perl 6 and Perl 5 sides: 1) Perl 5 and Rakudo have a fundamental difference of opinion about what some ranges produce ("A" .. "z", "X" .. "T", etc) and yet we've never really articulated why we want that. 2) We deny that a range whose LHS is "larger" than its RHS makes sense, but we also don't provide an easy way to construct such ranges lazily otherwise. This would be annoying only, but then we have declared that ranges are the right way to construct basic loops (e.g. for (1..1e10).reverse -> $i {...} which is not lazy (blows up your machine) and feels awfully clunky next to for 1e10..1 -> $i {...} which would not blow up your machine, or even make it break a sweat, if it worked) 3) We've never had a clear-cut goal in allowing string ranges (as opposed to character ranges, which Perl 5 and 6 both muddy a bit), so "intuitive" becomes sketchy at best past the first grapheme, and ever muddier when only considering codepoints (thus that wing of my proposal and current behavior are on much shakier ground, except in so far as it asserts that we might want to think about it more). 4) Many ranges involving single characters on LHS and RHS result in null or infinite output, which is deeply non-intuitive to me, and I expect many others. Solve those (and I tried in my suggestion) and I think you will be able to apply intuition to character ranges, but only in so far as a human being is likely to be able to intuit anything related to Unicode. The current behaviour of the range operator is (if I recall correctly): > > 1) if both sides are single characters, make a range by incrementing > codepoints > Sadly, you can't do that reasonably. Here are some examples of why, using only Latin and Greek as examples (not the most convoluted Unicode sections to be sure): - "Α" (capital Greek alpha, not Latin A) .. "Ω" would result in a range that contains an invalid codepoint (rakudo: drops the invalid codepoint, which you may have meant to imply, but I'm being pedantic because I want to come to a specification, not just a sense of the right solution) - "Ā" .. "Ē" would be "ĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒ" which is really not what you're likely to expect! (rakudo: Ā, infinitely repeating, which is an even larger problem for Katakana, where "オ" .. "ヺ" seems a very intuitive way to say "all Katakana non-cased letters" but fails because the range contains both cased and uncased; Perl 5 just prints "オ", and I think it also sneers at you) - "A" .. "z" comes out really odd because it contains punctuation (mind you, your suggestion is saner than Rakudo's current behavior on "A" .. "z" which is an infinite progression of capital-letter-only sequences of 1 or more characters! Intuitive, it's not.) My point was that, if you want simple and intuitive out of Unicode, you're kind of screwed. The closest you can get is to build your range using properties and script. The way I suggested doing that was the simplest I could think of. Speak up if you have a simpler one. For most simple ranges, our results will be identical (e.g "A" .. "Q"). For the above examples, I would end up producing: 1: Alpha through Omega greek capital letters 2: ĀĂĄĆĈĊČĎĐĒ (and オカガキギクグケゲコゴサザシジスズセゼソゾタダチヂツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモヤユヨラリルレロワヰヱヲンヴヷヸヹヺ for the Katakana) 3: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz That seems pretty darned intuitive to me. Mind you, "A" .. "ž" is still ugly as sin in terms of ordering once you listify, and I can't reasonably fix that without re-defining Unicode or having a really, really convoluted and special-case rule, but without getting convoluted, even that ugly example does something useful and, I dare say, intuitive for testing membership. Here's the pseudo-code for my suggestion: class SingleCharAlphaRange { has $.start; has $.end; # Verify that this is a single character string which is valid # and non-combining/non-modifying and represented by # one and only one codepoint. method valid(Str $s --> Bool) { # Assert that this is valid Unicode, 1 codepoint string which is not a # combining or modifying codepoint.... } # Is $s in this range? method in-range(Str $s --> Bool) { return fail() unless self.valid($s); # "abc" ~~ "a" .. "z" return True if self.start eq $s or self.end eq $s; # "a" ~~ "a" .. "z" return False if $s.ord < self.start.ord; # "a" ~~ "b" .. "z" return False if $s.ord > self.end.ord; # "z" ~~ "a" .. "y" my @props-a = self.props_start; # get script and properties for $.start my @props-b = self.props_end; # ' ' $.end my @props-s = self.props($s); # ' ' $s if @props-a ~~ @props-s or @props-b ~~ @props-s { return True; "b" ~~ "a" .. "z" } } ... method list() { gather do { for self.start.ord .. self.end.ord -> $i { take chr($i) if self.inrange(chr($i)); } } } 2) otherwise, call .succ on the LHS. Stop before the generated values > exceed the RHS. > Isn't that what you do when you try to listify a range? A range doesn't do any of that unless you try to walk it. What happens when you ask if a value is in the range "AA" .. "zz"? Do you iterate through every possible value and then return false if nothing matched? > I'm not convinced it should be any more complicated than that. If you have an idea that makes it simpler, I'm all ears. But I don't see anything that makes it simpler than my suggestion FOR THE USER. For us, it can be absurdly complex, but I would hope we could keep it simple for the user. PS: Notice that 5..1 would have to be 5,4,3,2,1 for this proposal to really make sense, which I believe it needs to. After all, currently (1..1e10).reverse will blow up, and there's no really good way around that. It would be much simpler to just be able to say "1e10 .. 1" and there's not really a reason I can think of not to that doesn't boil down to "people expect that to fail because of Perl 5." PPS: Other unexpected results in Rakudo, all related to the behavior that Rakudo seems to have around ranges that it doesn't think are legitimate for ranges: it repeats the LHS infinitely: "䷀" .. "䷿" - expected: all hexagram characters; got: first character, infinitely repeating. "鐀" .. "鐅" - expected: all CJK Unified Ideographs between u+9400 and u+9405; got: first character, infinitely repeating. "٠" .. "٩" - expected: all Arabic-Indic digits zero through nine; got: first digit (zero) repeating (note: bidi may confuse display in this email) "א" .. "ת" - expected: all Hebrew letters; got: first character (א) repeating (note: bidi may confuse display in this email) "A" .. "E" - expected: all full width, capital letters A through E; got: full width A repeating. -- Aaron Sherman Email or GTalk: a...@ajs.com http://www.ajs.com/~ajs